Enabling Large Scale RLHF of GPTOSS with Megatron backend in VeRL
Blog post from HuggingFace
The document discusses the large-scale reinforcement learning from human feedback (RLHF) of the GPTOSS model using the Megatron backend in the VeRL community. The experiments highlighted the linear scaling of post-training GPTOSS-20B using the GRPO framework on a significant number of GPU cards, leading to a drastic reduction in training time and costs. The use of different data types like BF16 and FP8 was explored for post-training efficiency, and the proprietary Slurm post-training platform was extended to support these capabilities. The document also touched on other models like Qwen3-Next-Coder and Step-3_5-Flash, emphasizing their suitability for agentic workflows through enhanced attention mechanisms. The performance of GPTOSS-120B was noted as competitive in speed and ranking among non-proprietary models. The system's design allows decoupling inference from training to optimize resources, and the integration of various backend technologies like vLLM/SGLang and Megatron was discussed in the context of optimizing training and inference pipelines.