Long-context GRPO

Post Details

Company

Unsloth

Date Published

Feb. 20, 2025

Author

Daniel & Michael

Word Count

1,863

Language

English

Hacker News Points

-

Source URL

unsloth.ai/blog/grpo

Summary

Unsloth has released an efficient GRPO algorithm that dramatically reduces the VRAM requirements for long-context language model training, achieving up to 90% less VRAM usage compared to traditional GRPO implementations. This advancement allows models like Llama 3.1 to operate with significantly reduced memory requirements, using just 54.3GB of VRAM for 20K context lengths as opposed to the previous 510.8GB. The Unsloth algorithm employs several memory-saving techniques, including gradient checkpointing and intermediate gradient accumulation, while maintaining nearly the same speed. Additionally, Unsloth offers full logging for reward functions and supports dynamic quantization for enhanced accuracy. The team also introduced new capabilities in their vLLM framework, allowing for fast inference and reduced KV cache space on modern GPUs, and provided updates about their involvement in the GitHub Universe and recent collaborations.