Cache-aware disaggregated inference for long-context LLM serving
Blog post from Together AI
In the realm of AI applications that demand long context lengths, a novel approach called cache-aware prefill–decode disaggregation (CPD) is enhancing the efficiency of inference systems. As AI tasks like multi-turn conversations and coding copilots become increasingly common, handling large prompts efficiently is crucial. Traditional systems struggle with varying time-to-first-token (TTFT) due to shared context demands, especially when dealing with both new (cold) and previously encountered (warm) requests. CPD addresses this by employing a three-tiered system that separates heavy computation from context reuse, thus optimizing hardware utilization and reducing latency through efficient cache management. By distinguishing between requests with high and low context reuse, CPD effectively allocates resources, preventing cold requests from monopolizing capacity and ensuring warm requests are processed swiftly. Evaluations show that CPD enhances throughput by up to 40% and maintains lower latency under high load compared to conventional models, making it a significant advancement in handling long-context AI workloads.