Accurate KV Cache Quantization with Outlier Tokens Tracing
Blog post from Arize
The paper "Accurate KV Cache Quantization with Outlier Tokens Tracing (OTT)" presents a novel approach to enhancing the efficiency of Large Language Model (LLM) inference by addressing the challenges associated with the Key-Value (KV) cache. Traditional quantization methods often struggle with "outlier tokens," which are tokens that have atypically small Key vector magnitudes and can significantly impact quantization accuracy. The OTT method addresses this by dynamically identifying and excluding these outlier tokens from quantization, preserving their full-precision representations, thereby improving quantization accuracy and achieving up to a 6.4× reduction in memory usage and a 2.3× increase in inference throughput under 2-bit quantization. This tuning-free method is compatible with existing inference engines, making it a practical solution for deploying LLMs in resource-constrained environments. OTT is particularly effective for long-form generation tasks and memory-constrained settings, demonstrating that intelligent token selection during quantization can optimize LLM efficiency without compromising accuracy.