Medusa: Simple framework for accelerating LLM generation with multiple decoding heads

Company

Together AI

Date Published

Sept. 11, 2023

Author

Tianle Cai*, Yuhong Li*, Zhengyang Geng, Hongwu Peng, Tri Dao (* Equal contribution)

Word count

2817

Language

English

Hacker News points

None

URL

www.together.ai/blog/medusa

Summary

Medusa, a new framework for accelerating Large Language Models (LLMs) generation, offers a simpler alternative to speculative decoding methods. By introducing multiple decoding heads on top of the original model, Medusa can improve LLM generation efficiency by about 2x. The framework uses a tree-based attention mechanism to process multiple candidates in parallel, reducing the need for importance sampling and increasing computational throughput. Medusa's typical acceptance scheme also offers a more efficient way to generate creative output while maintaining adaptability with varying sampling temperatures. Through rigorous testing on Vicuna models, Medusa achieved a 2x speedup in wall time across various use cases, making it an attractive option for accelerating LLM generation and serving applications.