Disaggregated Inference,Part 2: Moving the KV Cache Without Stalling the Decode

Post Details

Company

Momento

Date Published

May 6, 2026

Author

Hien Luu

Word Count

674

Language

English

Hacker News Points

-

Source URL

www.gomomento.com/blog/moving-the-kv-cache-without-stalling-the-decode

Summary

In the continuation of exploring disaggregated inference, the focus shifts to efficiently moving the key-value (KV) cache between prefill and decode stages without stalling the decode process. The approach involves layer-wise streaming, where computed KV cache blocks start streaming to the decode node immediately, significantly reducing visible transfer latency. This is complemented by a tiered storage system for the KV cache, utilizing HBM, CPU DRAM, and SSDs to optimize access speed and reduce bottlenecks. Different strategies for managing the cache handoff—such as DistServe's pull method and Mooncake's push method—are evaluated, with shared storage offering a balanced alternative. The practical benefits are demonstrated in a production setting with Perplexity's KV Messenger, which uses RDMA for efficient synchronization and memory management, resulting in a significant increase in throughput. This orchestration turns the challenge from a compute-intensive problem into a cache optimization task, enabling systems like Moonshot AI's Kimi to handle significantly more requests, showcasing the transformative potential of these techniques in AI workloads.