Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

Cache-aware disaggregated inference for long-context LLM serving

Blog post from Together AI

Post Details
Company
Date Published
Author
Jiejing Zhang, Yubo Wang, Yinghui Liu, Mourya Vangala Srinivasa, Chenxi Li, Jue Wang, Yineng Zhang, Shuaiwen Leon Song, Ce Zhang
Word Count
1,975
Language
English
Hacker News Points
-
Summary

In the realm of AI applications that demand long context lengths, a novel approach called cache-aware prefill–decode disaggregation (CPD) is enhancing the efficiency of inference systems. As AI tasks like multi-turn conversations and coding copilots become increasingly common, handling large prompts efficiently is crucial. Traditional systems struggle with varying time-to-first-token (TTFT) due to shared context demands, especially when dealing with both new (cold) and previously encountered (warm) requests. CPD addresses this by employing a three-tiered system that separates heavy computation from context reuse, thus optimizing hardware utilization and reducing latency through efficient cache management. By distinguishing between requests with high and low context reuse, CPD effectively allocates resources, preventing cold requests from monopolizing capacity and ensuring warm requests are processed swiftly. Evaluations show that CPD enhances throughput by up to 40% and maintains lower latency under high load compared to conventional models, making it a significant advancement in handling long-context AI workloads.