Open-Sourcing Baseten’s Suffix Automaton MTP Accelerator

Post Details

Company

Baseten

Date Published

Jan. 28, 2026

Author

Mahmoud Hassan 1 other

Word Count

1,290

Language

English

Hacker News Points

-

Source URL

www.baseten.co/blog/boosting-mtp-acceptance-rates-in-baseten-speculation-engine

Summary

Baseten has developed a hybrid speculative decoding method that integrates a suffix automaton with multi-token prediction (MTP) to enhance the efficiency of token prediction in applications such as code generation. This approach reduces latency and increases throughput by allowing for more efficient token verification and prediction, achieving up to 40% higher throughput and lower latency compared to MTP alone. The suffix automaton offers improvements over traditional n-gram lookups by enabling predictions of arbitrarily long patterns and updating in real time, which is particularly effective in scenarios involving long, repetitive sequences. Baseten's speculation engine, part of their inference stack, switches between suffix automaton matches and MTP based on the length of the match to optimize prediction efficiency. This method has been integrated into the open-source NVIDIA TensorRT-LLM, ensuring minimal overhead and high performance in production workloads. The implementation leverages CUDA for efficient data transfer and processing, maintaining high GPU utilization with minimal idle time. Furthermore, the approach is compatible with existing MTP/EAGLE setups and offers potential for future enhancements, such as continuous model training and dynamic-length speculation.