2.4x faster Gemma + 58% less VRAM

Post Details

Company

Unsloth

Date Published

Feb. 26, 2024

Author

Daniel & Michael

Word Count

1,049

Language

English

Hacker News Points

-

Source URL

unsloth.ai/blog/gemma

Summary

Unsloth has announced enhancements to their Gemma models, achieving significantly faster fine-tuning and inference speeds with reduced VRAM usage compared to vanilla Hugging Face (HF) and Flash Attention 2 (FA2). On a single A100 80GB GPU, Unsloth can manage up to 40K tokens, outperforming both FA2 and vanilla HF in token capacity. The team has introduced new chat templates for more dynamic dataset fine-tuning and has made improvements to their Colab notebooks for easier model customization. Notably, Gemma's architecture diverges from Llama and Mistral with unique features like a larger vocab size and different activation functions, impacting memory usage and performance. Unsloth Studio (Beta) is set to launch soon, promising a streamlined fine-tuning process via Google Colab. The developers also highlighted a collaboration with Hugging Face to address a precision issue with RoPE embeddings in bfloat16, which they claim to have resolved in Unsloth. Despite being a small team, they continue to release model updates and optimizations and encourage community support through donations and participation on platforms like Discord and Twitter.