2.4x faster Gemma + 58% less VRAM
Blog post from Unsloth
Unsloth has announced enhancements to their Gemma models, achieving significantly faster fine-tuning and inference speeds with reduced VRAM usage compared to vanilla Hugging Face (HF) and Flash Attention 2 (FA2). On a single A100 80GB GPU, Unsloth can manage up to 40K tokens, outperforming both FA2 and vanilla HF in token capacity. The team has introduced new chat templates for more dynamic dataset fine-tuning and has made improvements to their Colab notebooks for easier model customization. Notably, Gemma's architecture diverges from Llama and Mistral with unique features like a larger vocab size and different activation functions, impacting memory usage and performance. Unsloth Studio (Beta) is set to launch soon, promising a streamlined fine-tuning process via Google Colab. The developers also highlighted a collaboration with Hugging Face to address a precision issue with RoPE embeddings in bfloat16, which they claim to have resolved in Unsloth. Despite being a small team, they continue to release model updates and optimizations and encourage community support through donations and participation on platforms like Discord and Twitter.