Reduce TTFT by >50% with LMCache + Momento Accelerator
Blog post from Momento
In an exploration of performance optimization for large-scale inference clusters, the integration of LMCache with Momento Accelerator demonstrates a significant reduction in cold start time to first token (TTFT) by over 50% by offloading key-value (KV) caches to remote storage solutions like Valkey and S3. Momento, known for its hyperscale caching and routing capabilities, supports notable companies such as Snap and Coinbase, and has developed the Momento Accelerator for AI (MAX AI), which is compatible with common frameworks like vLLM and sglang. The approach leverages a distributed KV cache offloading system that utilizes a high-bandwidth transfer engine to connect peer inference workers and storage nodes, extending GPU memory to access multi-tiered local and remote storage without discarding previous computations. Additionally, Momento Accelerator for S3 serves as a low-latency object store, providing high throughput and low latency for various applications such as live-streaming and IoT analytics. The integration of LMCache and Momento Accelerator, tested with vLLM, showcases enhanced cluster management, enabling rapid node warm-ups from durable storage and presenting numerous opportunities for further optimization in Rust.