From GRPO to DAPO and GSPO: What, Why, and How

Company

HuggingFace

Date Published

Aug. 9, 2025

Author

Yihua Zhang

Word count

5841

Language

Hacker News points

None

URL

huggingface.co/blog/NormalUhr/grpo-to-dapo-and-gspo

Summary

In the realm of large language models, reinforcement learning has evolved significantly from using the Proximal Policy Optimization (PPO) to more advanced methods like GRPO, DAPO, and GSPO to address the limitations inherent in each previous approach. GRPO improved scalability by eliminating dependency on a value model, although it still faced challenges in efficiency and stability, particularly in handling long text outputs. DAPO refined GRPO by introducing methods such as Clip-Higher, Dynamic Sampling, and Token-Level Gradient Loss to enhance efficiency and stability, especially in MoE architectures. However, even with DAPO's improvements, GRPO encountered issues in converging stably in complex scenarios. This led to the development of GSPO, which transitioned the optimization process from a token-level to a sequence-level, thereby reducing variance and structural noise, providing a more stable and efficient training environment. GSPO's sequence-level optimization aligns more closely with the task nature, offering significant advantages in training large models, particularly those with dynamically activated experts, by avoiding routing path dependencies and enhancing stability. This evolutionary path highlights the importance of aligning reinforcement learning objectives with the inherent nature of tasks to achieve scalable, efficient, and stable model training.