DFlash: 3x faster LLM inference
Blog post from Baseten
Speculative decoding (SpecDec) has become a prominent technique to enhance the latency of large language models (LLMs) by having a smaller draft model propose tokens that a target model verifies, effectively improving speed by over twofold. EAGLE, a popular SpecDec method, utilizes the hidden states of the target model to predict draft tokens but is limited by autoregressive processing, often capping speedups at roughly 2x. To overcome this, DFlash was introduced to surpass the limitations of autoregressive drafting by using a single forward pass with bidirectional attention to predict multiple tokens, resulting in a significant speed boost. In practice, Baseten's implementation of DFlash shows a 3x speed improvement over the baseline on various benchmarks, outperforming both EAGLE and other DFlash implementations like vLLM and SGLang in throughput and latency. DFlash leverages the target model's hidden states, allowing for deeper draft models without sacrificing speed and improving the quality of speculative drafts. Testing across benchmarks like GSM8k, MATH-500, and NVIDIA’s Nemotron post-training dataset, Baseten's DFlash consistently outperforms other techniques in terms of inference speed and accuracy, demonstrating its effectiveness in bridging the quality of autoregressive decoding with the speed of diffusion LLMs.