Content Deep Dive
Even Better, Even Faster Quantized LLMs with QTIP
Blog post from Together AI
Post Details
Company
Date Published
Author
Albert Tseng, Qingyao Sun, David Hou, Chris De Sa
Word Count
3,170
Language
English
Hacker News Points
-
Summary
QTIP (Quantization with Trellises and Incoherence Processing) is a new weight-only LLM post-training quantization method that achieves state-of-the-art quality and inference speed. It compresses model weights using trellis coded quantization, which significantly improves over QuIP#'s quality while being 3X faster than unquantized models. QTIP builds upon the incoherence processing framework introduced by QuIP and uses trellis coded quantization to achieve lower distortion on i.i.d Gaussian sources compared to vector quantization. The bitshift trellis and compute-based codes enable fast decoding for weight-only quantization, making QTIP practical for memory-bound inference settings.