Company
Date Published
Author
Tri Dao 2 others
Word count
1188
Language
English
Hacker News points
None

Summary

Baseten's advancements in speculative decoding, particularly with EAGLE-3, have significantly increased the performance of their GPT-OSS 120B inference API, achieving over 650 tokens per second, as verified by Artificial Analysis. As a launch partner for GPT-OSS, Baseten has utilized powerful NVIDIA hardware, including B200 GPUs, TensorRT-LLM, and NVIDIA Dynamo, to establish itself as the fastest NVIDIA-based provider, rivaling custom hardware providers. The implementation of techniques such as tensor parallelism and KV-aware routing have further optimized performance, while the potential for additional improvements through PD disaggregation and advanced speculation remains under investigation. Benchmarks from Artificial Analysis and OpenRouter highlight Baseten's superior performance, showcasing their ability to match or exceed custom hardware solutions while maintaining flexibility and scalability with NVIDIA GPUs. Baseten continues to innovate in model performance engineering and offers opportunities for further exploration and employment in this field.