Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

Company

Together AI

Date Published

May 12, 2025

Author

Wai Tong Chung, Dan Waters, Avner May, Ben Athiwaratkun

Word count

1284

Language

English

Hacker News points

None

URL

www.together.ai/blog/customized-speculative-decoding

Summary

We leverage speculative decoding, a technique that optimizes speed and cost of our serverless and dedicated inference endpoints. This method uses a smaller, faster "speculator" model to speculate the next few tokens, which are then verified in parallel by the larger model. A strong speculator has two key properties: speed and alignment with the target model. By fine-tuning our speculators on specific domains of interest, we can gain higher speedups. Our state-of-the-art Base Speculator already provides 1.44-2.27x speedups over conventional next-token prediction for DeepSeek-R1 inference workloads. However, customizing our speculators further leads to additional speedups of 1.23-1.45x. This translates to a total speedup of 1.85-2.97x and a reduction in overall cost by ~25% compared to the Base Speculator. Furthermore, using custom speculators can increase throughput per GPU, thereby lowering overall inference costs. By training our Custom Speculators with data from specific workloads, we can achieve significant improvements, such as a 23%-26% reduction in GPU hours needed for generating 1B tokens. This demonstrates the effectiveness of our speculative decoding techniques and highlights the potential benefits of customizing speculators for individual workloads.