Home / Companies / Cloudflare / Blog / Post Details
Content Deep Dive

Unweight: how we compressed an LLM 22% without sacrificing quality

Blog post from Cloudflare

Post Details
Company
Date Published
Author
Mari Galicer, Ivan Nikulin, and Chris Branch
Word Count
3,094
Language
English
Hacker News Points
-
Summary

Unweight is a lossless compression system developed to address the memory bandwidth bottleneck in Cloudflare's inference platform by reducing the size of model weights, enabling faster and more efficient token generation on NVIDIA H100 GPUs. By compressing model weights up to 15-22% without compromising on accuracy, Unweight effectively decreases the data that must traverse from high bandwidth memory to the GPU's fast shared memory, thereby optimizing the use of tensor cores. It employs Huffman coding to compress the exponent byte of model weights, achieving significant size reductions, particularly in multilayer perceptron (MLP) weight matrices. Through multiple execution strategies and an autotuner, Unweight adapts to different workloads and batch sizes, balancing decompression and computational efforts to improve throughput. The initiative has already shown promising results with Llama 3.1-8B models, enabling cost savings and increased deployment flexibility across Cloudflare's network by reducing VRAM usage and transfer times for model distribution. This innovative approach to model weight compression opens up new possibilities for enhancing GPU efficiency and encourages further research in the field.