Long context gpt-oss Fine-tuning
Blog post from Unsloth
Unsloth has introduced Flex Attention for OpenAI's gpt-oss training, enabling significantly longer context lengths, reduced VRAM usage, and faster training speeds compared to other implementations, including Flash Attention 3. This advancement allows for training with up to 60K context length on 80GB VRAM using BF16 LoRA, and even longer with QLoRA, while addressing previous issues such as infinite training losses on float16 GPUs. Flex Attention provides customizable attention mechanisms, allowing for efficient training by utilizing attention sinks and sliding window techniques. The new capability to export QLoRA fine-tuned gpt-oss models to platforms like llama.cpp, vLLM, and Hugging Face, along with improved direct fine-tuning support, enhances the flexibility and usability of the models. Additionally, bug fixes have been implemented to ensure consistent training loss behavior across different GPU configurations.