Boosting MTP acceptance in TensorRT-LLM: +40% throughput

Post Details

Company

Baseten

Date Published

Jan. 24, 2026

Author

Mahmoud Hassan 1 other

Word Count

1,288

Language

English

Hacker News Points

-

Source URL

www.baseten.co/blog/boosting-mtp-acceptance-rates-in-tensorrt-llm

Summary

Baseten has developed a hybrid speculative decoding method that combines n-gram speculation and draft model speculation, such as EAGLE or multi-token prediction (MTP), using a suffix automaton to improve token prediction accuracy and efficiency in applications like code generation. This method, integrated into TensorRT-LLM, enhances performance by batching the token verification phase, resulting in up to 40% higher throughput and lower latency compared to MTP alone. By leveraging a suffix automaton for decoding, it identifies long patterns and updates in real-time, leading to higher acceptance rates on long sequences. The hybrid approach allows dynamic adaptation to workload requirements and exhibits interoperability between C++ and CUDA, thus achieving zero overhead while maintaining high GPU utilization. Future work includes continuous draft model training and dynamic-length speculation, indicating further potential for efficiency gains in speculative decoding without altering configuration parameters.