Boosting MTP acceptance in TensorRT-LLM: +40% throughput
Blog post from Baseten
Baseten has developed a hybrid speculative decoding method that combines n-gram speculation and draft model speculation, such as EAGLE or multi-token prediction (MTP), using a suffix automaton to improve token prediction accuracy and efficiency in applications like code generation. This method, integrated into TensorRT-LLM, enhances performance by batching the token verification phase, resulting in up to 40% higher throughput and lower latency compared to MTP alone. By leveraging a suffix automaton for decoding, it identifies long patterns and updates in real-time, leading to higher acceptance rates on long sequences. The hybrid approach allows dynamic adaptation to workload requirements and exhibits interoperability between C++ and CUDA, thus achieving zero overhead while maintaining high GPU utilization. Future work includes continuous draft model training and dynamic-length speculation, indicating further potential for efficiency gains in speculative decoding without altering configuration parameters.