Home / Companies / Baseten / Blog / Post Details
Content Deep Dive

Boosting MTP acceptance in TensorRT-LLM: +40% throughput

Blog post from Baseten

Post Details
Company
Date Published
Author
Mahmoud Hassan 1 other
Word Count
1,288
Language
English
Hacker News Points
-
Summary

Baseten has developed a hybrid speculative decoding method that combines n-gram speculation and draft model speculation, such as EAGLE or multi-token prediction (MTP), using a suffix automaton to improve token prediction accuracy and efficiency in applications like code generation. This method, integrated into TensorRT-LLM, enhances performance by batching the token verification phase, resulting in up to 40% higher throughput and lower latency compared to MTP alone. By leveraging a suffix automaton for decoding, it identifies long patterns and updates in real-time, leading to higher acceptance rates on long sequences. The hybrid approach allows dynamic adaptation to workload requirements and exhibits interoperability between C++ and CUDA, thus achieving zero overhead while maintaining high GPU utilization. Future work includes continuous draft model training and dynamic-length speculation, indicating further potential for efficiency gains in speculative decoding without altering configuration parameters.