Home / Companies / Baseten / Blog / Post Details
Content Deep Dive

The Baseten Inference Stack at NVIDIA Dynamo Day

Blog post from Baseten

Post Details
Company
Date Published
Author
Rachel Rapp
Word Count
1,098
Language
English
Hacker News Points
-
Summary

The Baseten Inference Stack utilizes a combination of open-source tools like NVIDIA Dynamo and in-house innovations to optimize generative AI workloads with the lowest latency and highest throughput. By leveraging Dynamo, which is framework-agnostic and regularly updated, Baseten can integrate various inference engines tailored to specific models and use cases. The use of Dynamo facilitates improvements in system-level inference performance through optimizations like disaggregated serving, KV cache-aware routing, and KV cache offloading, leading to significant reductions in latency and increases in throughput. Baseten's engineers contribute to the Dynamo ecosystem by offering enhancements and new features, which were highlighted during NVIDIA's Dynamo Day event. These strategies enable Baseten to achieve a 99.99% reliability rate in their AI model performance, while also supporting multimodal model serving by extending the capabilities of Dynamo in handling complex AI workloads.