How we made the fastest GPT-OSS on NVIDIA GPUs 60% faster

Post Details

Company

Baseten

Date Published

Oct. 24, 2025

Author

Tri Dao 2 others

Word Count

1,188

Language

English

Hacker News Points

-

Source URL

www.baseten.co/blog/how-we-made-the-fastest-gpt-oss-on-nvidia-gpus-60-percent-faster

Summary

Baseten's advancements in speculative decoding, particularly with EAGLE-3, have significantly increased the performance of their GPT-OSS 120B inference API, achieving over 650 tokens per second, as verified by Artificial Analysis. As a launch partner for GPT-OSS, Baseten has utilized powerful NVIDIA hardware, including B200 GPUs, TensorRT-LLM, and NVIDIA Dynamo, to establish itself as the fastest NVIDIA-based provider, rivaling custom hardware providers. The implementation of techniques such as tensor parallelism and KV-aware routing have further optimized performance, while the potential for additional improvements through PD disaggregation and advanced speculation remains under investigation. Benchmarks from Artificial Analysis and OpenRouter highlight Baseten's superior performance, showcasing their ability to match or exceed custom hardware solutions while maintaining flexibility and scalability with NVIDIA GPUs. Baseten continues to innovate in model performance engineering and offers opportunities for further exploration and employment in this field.