Home / Companies / Baseten / Blog / Post Details
Content Deep Dive

Open-Sourcing Baseten’s Suffix Automaton MTP Accelerator

Blog post from Baseten

Post Details
Company
Date Published
Author
Mahmoud Hassan 1 other
Word Count
1,290
Language
English
Hacker News Points
-
Summary

Baseten has developed a hybrid speculative decoding method that integrates a suffix automaton with multi-token prediction (MTP) to enhance the efficiency of token prediction in applications such as code generation. This approach reduces latency and increases throughput by allowing for more efficient token verification and prediction, achieving up to 40% higher throughput and lower latency compared to MTP alone. The suffix automaton offers improvements over traditional n-gram lookups by enabling predictions of arbitrarily long patterns and updating in real time, which is particularly effective in scenarios involving long, repetitive sequences. Baseten's speculation engine, part of their inference stack, switches between suffix automaton matches and MTP based on the length of the match to optimize prediction efficiency. This method has been integrated into the open-source NVIDIA TensorRT-LLM, ensuring minimal overhead and high performance in production workloads. The implementation leverages CUDA for efficient data transfer and processing, maintaining high GPU utilization with minimal idle time. Furthermore, the approach is compatible with existing MTP/EAGLE setups and offers potential for future enhancements, such as continuous model training and dynamic-length speculation.