Company
Date Published
Author
Yihua Zhang
Word count
5841
Language
-
Hacker News points
None

Summary

In the realm of large language models, reinforcement learning has evolved significantly from using the Proximal Policy Optimization (PPO) to more advanced methods like GRPO, DAPO, and GSPO to address the limitations inherent in each previous approach. GRPO improved scalability by eliminating dependency on a value model, although it still faced challenges in efficiency and stability, particularly in handling long text outputs. DAPO refined GRPO by introducing methods such as Clip-Higher, Dynamic Sampling, and Token-Level Gradient Loss to enhance efficiency and stability, especially in MoE architectures. However, even with DAPO's improvements, GRPO encountered issues in converging stably in complex scenarios. This led to the development of GSPO, which transitioned the optimization process from a token-level to a sequence-level, thereby reducing variance and structural noise, providing a more stable and efficient training environment. GSPO's sequence-level optimization aligns more closely with the task nature, offering significant advantages in training large models, particularly those with dynamically activated experts, by avoiding routing path dependencies and enhancing stability. This evolutionary path highlights the importance of aligning reinforcement learning objectives with the inherent nature of tasks to achieve scalable, efficient, and stable model training.