Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

Medusa: Simple framework for accelerating LLM generation with multiple decoding heads

Blog post from Together AI

Post Details
Company
Date Published
Author
Tianle Cai*, Yuhong Li*, Zhengyang Geng, Hongwu Peng, Tri Dao (* Equal contribution)
Word Count
2,817
Language
English
Hacker News Points
-
Summary

Medusa, a new framework for accelerating Large Language Models (LLMs) generation, offers a simpler alternative to speculative decoding methods. By introducing multiple decoding heads on top of the original model, Medusa can improve LLM generation efficiency by about 2x. The framework uses a tree-based attention mechanism to process multiple candidates in parallel, reducing the need for importance sampling and increasing computational throughput. Medusa's typical acceptance scheme also offers a more efficient way to generate creative output while maintaining adaptability with varying sampling temperatures. Through rigorous testing on Vicuna models, Medusa achieved a 2x speedup in wall time across various use cases, making it an attractive option for accelerating LLM generation and serving applications.