How we run GPT OSS 120B at 500+ tokens per second on NVIDIA GPUs

Company

Baseten

Date Published

Aug. 7, 2025

Author

Amir Haghighat 4 others

Word count

938

Language

English

Hacker News points

None

URL

www.baseten.co/blog/sota-performance-for-gpt-oss-120b-on-nvidia-gpus

Summary

Achieving state-of-the-art (SOTA) latency and throughput for the GPT OSS 120B model on NVIDIA GPUs involves a complex process of performance optimization, including experimentation, bug fixing, and benchmarking. The Baseten Inference Stack significantly contributes to this endeavor by allowing rapid performance improvements through its flexible architecture and the expertise of its model performance engineering team. Upon the model's release, engineers work in parallel using different inference frameworks such as TensorRT-LLM, vLLM, and SGLang, ensuring compatibility with Hopper and Blackwell GPU architectures. The team addresses compatibility bugs and optimizes model configurations, choosing Tensor Parallelism for better latency and leveraging TensorRT-LLM MoE Backend for enhanced performance. These efforts lead to significant improvements, including adding 100 tokens per second while maintaining 100% uptime, and highlight the importance of inference optimization for immediate improvements in latency and throughput. The team continues to explore new methods like speculative decoding to further enhance model performance, with a focus on providing efficient solutions for developers looking to optimize their models.