Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding
Blog post from Together AI
Summary Distribution-aware speculative decoding (DAS) is an innovative framework designed to enhance the efficiency of the rollout phase in reinforcement learning (RL) post-training, offering up to a 50% speedup without affecting model outputs. This phase, crucial for models like DeepSeek-R1, has been identified as a significant bottleneck due to its long-tail nature, where a few slow generations delay the entire batch, causing GPU underutilization. DAS addresses this by employing an adaptive suffix tree drafter and a length-aware scheduling strategy, which together mitigate rollout stragglers and improve GPU load balancing. The suffix tree drafter, built from recent rollouts, continuously adapts to evolving model weights without retraining, while the scheduling strategy dynamically allocates resources based on request length. Experiments on RL tasks, such as math reasoning and code generation, demonstrate that DAS reduces rollout time significantly without compromising reward quality, making it a valuable solution for scaling RL post-training efficiently.