Live draft model training for speculative decoding

Post Details

Company

Baseten

Date Published

June 25, 2026

Author

Chloe Florit

Word Count

808

Company Posts That Month

13

Language

English

Hacker News Points

-

Source URL

www.baseten.co/blog/live-draft-model-training-for-speculative-decoding

Summary

Draft models, like EAGLE-3 and DFlash, are increasingly used to enhance large language model (LLM) inference by improving throughput and reducing latency, but aligning these models with base models and dynamic traffic patterns is challenging. A solution has been developed in the form of a distributed training pipeline that uses live inference to extract hidden states and train draft models in real-time, effectively bypassing the need for offline data storage. This approach has led to a median increase in accept rates by 20%, with some traffic patterns experiencing over 100% improvement, translating to faster speculative decoding and more efficient workloads. The architecture, integrated within the Baseten Inference Stack, operates with minimal overhead by using a highly optimized inference engine, leveraging GPU execution, memory management, and networking efficiency. The system also integrates frameworks like UCXX and Trio for robust networking and concurrency management, ensuring resilience against hardware failures and network disruptions.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	3	5,172	1,006	220	-43%
AI Model Fine-tuning	1	694	169	62	+13%
Real-time	1	5,457	1,338	238	-5%
Reinforcement learning	1	59	31	19	-34%