Continued Pretraining with Unsloth
Blog post from Unsloth
Unsloth has announced a new release that significantly enhances the efficiency of continual pretraining for large language models (LLMs), claiming a twofold increase in speed and a 50% reduction in VRAM usage compared to existing methods like Hugging Face with Flash Attention 2 QLoRA. This release includes a free Colab notebook for pretraining models such as Mistral v0.3 7b to learn new languages like Korean and offers insights into optimizing training processes, such as finetuning input and output embeddings and employing different learning rates to stabilize training. Unsloth's approach addresses issues identified in the "LoRA Learns Less and Forgets Less" paper by advocating for comprehensive training on all linear layers, including the gate projection matrix, lm_head, and embed_tokens, and suggests using rsLoRA for improved results. The importance of decoupled learning rates is emphasized, with Unsloth providing tools like UnslothTrainer and UnslothTrainingArguments to facilitate this process, demonstrating significant improvements in training loss reduction.