Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries
Blog post from HuggingFace
The article explores asynchronous reinforcement learning (RL) training practices, highlighting the inefficiencies of synchronous RL where data generation monopolizes time while GPUs remain idle. It recommends disaggregating inference and training onto separate GPU pools, connected by a rollout buffer, to allow parallel processing and minimize wait times. The survey of 16 open-source RL libraries identifies Ray as the dominant orchestration tool, with the NVIDIA Collective Communications Library as the standard for weight synchronization. The analysis covers various design strategies across seven axes, including orchestration, buffer design, weight sync, staleness management, and support for LoRA (Low-Rank Adaptation) training. The article delves into emerging trends and challenges, such as critic-free algorithms, process rewards, multi-agent co-evolution, and MoE (Mixture of Experts) models, stressing the need for adaptable infrastructure. It concludes with a call for lightweight orchestration and detailed design choices for an asynchronous trainer in the TRL library, emphasizing a bounded queue with per-token model versioning, efficient NCCL weight synchronization, and strategies for handling partial rollouts in complex tasks.