Home / Companies / Momento / Blog / Post Details
Content Deep Dive

Disaggregated Inference,Part 2: Moving the KV Cache Without Stalling the Decode

Blog post from Momento

Post Details
Company
Date Published
Author
Hien Luu
Word Count
674
Language
English
Hacker News Points
-
Summary

In the continuation of exploring disaggregated inference, the focus shifts to efficiently moving the key-value (KV) cache between prefill and decode stages without stalling the decode process. The approach involves layer-wise streaming, where computed KV cache blocks start streaming to the decode node immediately, significantly reducing visible transfer latency. This is complemented by a tiered storage system for the KV cache, utilizing HBM, CPU DRAM, and SSDs to optimize access speed and reduce bottlenecks. Different strategies for managing the cache handoff—such as DistServe's pull method and Mooncake's push method—are evaluated, with shared storage offering a balanced alternative. The practical benefits are demonstrated in a production setting with Perplexity's KV Messenger, which uses RDMA for efficient synchronization and memory management, resulting in a significant increase in throughput. This orchestration turns the challenge from a compute-intensive problem into a cache optimization task, enabling systems like Moonshot AI's Kimi to handle significantly more requests, showcasing the transformative potential of these techniques in AI workloads.