How Baseten achieved 2x faster inference with NVIDIA Dynamo

Post Details

Company

Baseten

Date Published

Oct. 17, 2025

Author

Abu Qader 2 others

Word Count

904

Language

English

Hacker News Points

-

Source URL

www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo

Summary

Baseten collaborates with NVIDIA to enhance model performance through the adoption of NVIDIA Dynamo, an open-source inference framework designed for large-scale LLM serving across distributed GPU clusters. A key feature of NVIDIA Dynamo is its KV cache-aware routing, which optimizes inference speeds by directing requests to model replicas with cached contexts, significantly reducing redundant computations and improving system performance. This routing approach, which balances cache hit rates and workload distribution, leads to substantial improvements in metrics such as time to first token (TTFT) and time per output token (TPOT), as demonstrated in benchmarks with models like Qwen3 Coder. Baseten has observed notable reductions in latency and increases in throughput, processing more requests per second and outputs per second. Looking ahead, Baseten plans to further leverage NVIDIA's tools, exploring features like KV cache offloading to enhance resource utilization and concurrency, and they will co-host a technical workshop with NVIDIA to share insights on maximizing AI inference workloads using NVIDIA Dynamo.