The Concurrency Cliff is a Memory Limit
Blog post from Momento
The text discusses the phenomenon known as the "Concurrency Cliff," which refers to a sharp increase in latency and a decrease in throughput when the number of concurrent sessions exceeds the capacity of the KV cache, necessitating a complete re-prefill of session data. This issue is explored through testing on various cache precisions—fp16, fp8, and turboquant 4-bit—and the results reveal that the tipping point occurs between 12 and 14 concurrent sessions. At 12 concurrent sessions, all data fits within the cache, maintaining low latency and high throughput, but at 14 sessions, latency sharply increases and throughput drops as sessions are forced to re-prefill their data. The study highlights that while median latency (p50) may appear manageable beyond the cliff, the tail latency (p99) is a more critical measure, as it reflects the significant delays faced by certain requests when their session data is not cached. This leads to a recommendation of maintaining a safe operating point of 6 concurrent sessions to meet a 2-second p99 SLA, emphasizing the importance of managing KV cache under load to prevent performance degradation.