Accurate KV Cache Quantization with Outlier Tokens Tracing

Post Details

Company

Arize

Date Published

June 5, 2025

Author

Jason Lopatecki

Word Count

832

Language

English

Hacker News Points

-

Source URL

arize.com/blog/accurate-kv-cache-quantization-with-outlier-tokens-tracing

Summary

The paper "Accurate KV Cache Quantization with Outlier Tokens Tracing (OTT)" presents a novel approach to enhancing the efficiency of Large Language Model (LLM) inference by addressing the challenges associated with the Key-Value (KV) cache. Traditional quantization methods often struggle with "outlier tokens," which are tokens that have atypically small Key vector magnitudes and can significantly impact quantization accuracy. The OTT method addresses this by dynamically identifying and excluding these outlier tokens from quantization, preserving their full-precision representations, thereby improving quantization accuracy and achieving up to a 6.4× reduction in memory usage and a 2.3× increase in inference throughput under 2-bit quantization. This tuning-free method is compatible with existing inference engines, making it a practical solution for deploying LLMs in resource-constrained environments. OTT is particularly effective for long-form generation tasks and memory-constrained settings, demonstrating that intelligent token selection during quantization can optimize LLM efficiency without compromising accuracy.