Home / Companies / Baseten / Blog / Post Details
Content Deep Dive

How to double tokens per second for Llama 3 with Medusa

Blog post from Baseten

Post Details
Company
Date Published
Author
Abu Qader, Philip Kiely
Word Count
1,462
Language
English
Hacker News Points
2
Summary

Medusa is a technique for generating multiple tokens per forward pass during LLM inference, which can double the tokens per second of an LLM deployment. After training and validating Medusa heads, additional decoding heads grafted onto the base model, Medusa can be used in production by deploying the modified LLM using TensorRT-LLM. In a benchmark, Medusa was found to double the tokens per second running Llama 3 8B on an A100 in FP16 with no other major optimizations in place. However, it is crucial to validate output quality before deploying a model with Medusa to production.