How DigitalOcean’s Agentic Inference Cloud powered by NVIDIA GPUs Achieved 67% Lower Inference Costs for Workato
Blog post from DigitalOcean
DigitalOcean's collaboration with Workato's AI Research Lab resulted in a significant reduction in inference costs and improved performance for Workato's automation processes using agentic AI capabilities. By deploying NVIDIA Dynamo with vLLM on DigitalOcean Kubernetes Service (DOKS), the team achieved a 67% lower inference cost by utilizing NVIDIA H200 GPUs, which provided enhanced memory capacity and efficient throughput. The key innovation was the implementation of KV-aware routing, which minimized redundant computations by leveraging warm KV caches, dramatically reducing latency and increasing throughput. This approach facilitated a 67% increase in tokens per second per GPU and reduced the number of GPUs needed by 40%, leading to substantial cost savings. The success of this project underscores the importance of optimizing the system architecture around AI models for efficient inference at scale, rather than merely relying on additional hardware.