Your KV Cache Benchmark Is “hi hi hi”
Blog post from Momento
Khawaja Shams explores the limitations of using synthetic data, specifically repetitive "hi hi hi" sequences, for benchmarking Key-Value (KV) cache offloading performance in inference tasks. The text highlights that such synthetic inputs lead to overly optimistic compression ratios and do not accurately reflect real-world scenarios, where diverse documents with complex structures, like medical and legal texts, affect compression and cache behavior differently. The benchmarks using these synthetic sequences fail to capture the true performance under realistic loads because the activation patterns, token diversity, and value distributions in real documents are significantly different from synthetic ones. To address this, the author emphasizes the importance of using realistic, reproducible corpora for benchmarking, which accurately represent the diverse conditions that KV caching systems will encounter in practical applications. This shift to realistic benchmarking is crucial for developing effective compression and caching strategies, as demonstrated by experiments using the Qwen3-8B-FP8 model and the introduction of advanced compression techniques like bit shuffling and TurboQuant.