Finetune Llama 3 - 2x faster + 6x longer context + 68% less VRAM

Post Details

Company

Unsloth

Date Published

April 23, 2024

Author

Daniel & Michael

Word Count

891

Language

English

Hacker News Points

-

Source URL

unsloth.ai/blog/llama3

Summary

Unsloth has introduced updates to the Llama 3 model, significantly enhancing its performance and efficiency in fine-tuning processes. The Llama 3 (8B) model can now be fine-tuned twice as fast while using 63% less memory compared to previous solutions like Flash Attention 2 (FA2) and Hugging Face, and Llama 3 (70B) achieves 1.8 times faster processing with a 68% reduction in VRAM. Notably, these models now support much longer context lengths, with the Llama-3 70B model fitting 48,000 tokens compared to 7,000 without Unsloth. They have made a Colab notebook available for finetuning the 8B model using a free Tesla T4 GPU, and pre-quantized models are available for faster downloads. Additionally, the update addresses quirks such as the tokenizer not adding the BOS token and dealing with untrained tokens in the base model to prevent issues during finetuning. The community is encouraged to support and engage with Unsloth through various platforms, including Discord and Twitter, while also looking forward to future updates like Phi 3 support.