Day zero benchmarks for Qwen 3 with SGLang on Baseten

Company

Baseten

Date Published

May 19, 2025

Author

Yineng Zhang

Word count

1303

Language

English

Hacker News points

None

URL

www.baseten.co/blog/day-zero-benchmarks-for-qwen-3-with-sglang-on-baseten

Summary

Qwen 3, a new family of open-source LLMs by Alibaba, introduces Qwen 3 235B, a state-of-the-art reasoning model that rivals DeepSeek-R1 but requires significantly fewer hardware resources to run in production. The model uses a Mixture of Experts (MoE) architecture with 128 experts and 8 experts per token, optimized for deployment with SGLang, an open-source fast inference framework. Qwen 3 achieves very usable performance on day zero, with smaller batches reducing latency but increasing the effective cost per token by lowering throughput. The model performs well on public benchmarks, comparing favorably to larger models like DeepSeek-R1 and Gemini 2.5 Pro. To take full advantage of its efficient performance in production, Qwen 3 can be served with low latency and high throughput using SGLang, which automatically splits the model appropriately using the --tp argument to specify the number of GPUs to use for inference. The model's performance can be further improved through various techniques, including varying temperature between thinking modes, taking advantage of agentic capabilities, and using the entire context window. Qwen 3 is now available in both FP8 and BF16 precisions, with recommendations to run in FP8 as it offers nearly identical quality at a much lower cost.