Speculative Decoding in Practice: How EAGLE3 Makes LLMs Faster Without Changing Their Outputs
Blog post from HuggingFace
In a detailed exploration of speculative decoding, the article discusses how Thoughtworks' EAGLE3 model accelerates large language model (LLM) inference by utilizing the GPU's idle compute capacity without altering output distribution. The method employs a dual-model setup, where a smaller draft model proposes multiple token candidates, and the main model verifies them in parallel, maintaining the output's accuracy. The EAGLE family of models enhances this process by training a draft head conditioned on the main model's internal representations, leading to significant speed improvements. EAGLE3's tri-layer feature fusion offers insights at multiple abstraction levels, resulting in a reported 4.1–6.5× speedup on specific benchmarks. The article also emphasizes the importance of validating speculative decoding through extensive benchmarking, addressing challenges with mixture-of-experts architectures, and ensuring that speculative decoding remains beneficial by maintaining high acceptance rates. Thoughtworks' initiative includes maintaining custom forks to support their models, further contributing to inference optimization efforts in the broader machine learning community.