Home / Companies / Momento / Blog / Post Details
Content Deep Dive

The Concurrency Cliff is a Memory Limit

Blog post from Momento

Post Details
Company
Date Published
Author
Khawaja Shams
Word Count
649
Language
English
Hacker News Points
-
Summary

The text discusses the phenomenon known as the "Concurrency Cliff," which refers to a sharp increase in latency and a decrease in throughput when the number of concurrent sessions exceeds the capacity of the KV cache, necessitating a complete re-prefill of session data. This issue is explored through testing on various cache precisions—fp16, fp8, and turboquant 4-bit—and the results reveal that the tipping point occurs between 12 and 14 concurrent sessions. At 12 concurrent sessions, all data fits within the cache, maintaining low latency and high throughput, but at 14 sessions, latency sharply increases and throughput drops as sessions are forced to re-prefill their data. The study highlights that while median latency (p50) may appear manageable beyond the cliff, the tail latency (p99) is a more critical measure, as it reflects the significant delays faced by certain requests when their session data is not cached. This leads to a recommendation of maintaining a safe operating point of 6 concurrent sessions to meet a 2-second p99 SLA, emphasizing the importance of managing KV cache under load to prevent performance degradation.