Company
Date Published
Author
Abu Qader 2 others
Word count
904
Language
English
Hacker News points
None

Summary

Baseten collaborates with NVIDIA to enhance model performance through the adoption of NVIDIA Dynamo, an open-source inference framework designed for large-scale LLM serving across distributed GPU clusters. A key feature of NVIDIA Dynamo is its KV cache-aware routing, which optimizes inference speeds by directing requests to model replicas with cached contexts, significantly reducing redundant computations and improving system performance. This routing approach, which balances cache hit rates and workload distribution, leads to substantial improvements in metrics such as time to first token (TTFT) and time per output token (TPOT), as demonstrated in benchmarks with models like Qwen3 Coder. Baseten has observed notable reductions in latency and increases in throughput, processing more requests per second and outputs per second. Looking ahead, Baseten plans to further leverage NVIDIA's tools, exploring features like KV cache offloading to enhance resource utilization and concurrency, and they will co-host a technical workshop with NVIDIA to share insights on maximizing AI inference workloads using NVIDIA Dynamo.