Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding

Post Details

Company

Together AI

Date Published

April 21, 2026

Author

Together AI

Word Count

2,872

Language

English

Hacker News Points

-

Source URL

www.together.ai/blog/distribution-aware-speculative-decoding

Summary

Summary Distribution-aware speculative decoding (DAS) is an innovative framework designed to enhance the efficiency of the rollout phase in reinforcement learning (RL) post-training, offering up to a 50% speedup without affecting model outputs. This phase, crucial for models like DeepSeek-R1, has been identified as a significant bottleneck due to its long-tail nature, where a few slow generations delay the entire batch, causing GPU underutilization. DAS addresses this by employing an adaptive suffix tree drafter and a length-aware scheduling strategy, which together mitigate rollout stragglers and improve GPU load balancing. The suffix tree drafter, built from recent rollouts, continuously adapts to evolving model weights without retraining, while the scheduling strategy dynamically allocates resources based on request length. Experiments on RL tasks, such as math reasoning and code generation, demonstrate that DAS reduces rollout time significantly without compromising reward quality, making it a valuable solution for scaling RL post-training efficiently.