Unweight: how we compressed an LLM 22% without sacrificing quality

Post Details

Company

Cloudflare

Date Published

April 17, 2026

Author

Mari Galicer, Ivan Nikulin, and Chris Branch

Word Count

3,094

Company Posts That Month

43

Language

English

Hacker News Points

-

Post removed?

No

Source URL

blog.cloudflare.com/unweight-tensor-compression

Summary

Unweight is a lossless compression system developed to address the memory bandwidth bottleneck in Cloudflare's inference platform by reducing the size of model weights, enabling faster and more efficient token generation on NVIDIA H100 GPUs. By compressing model weights up to 15-22% without compromising on accuracy, Unweight effectively decreases the data that must traverse from high bandwidth memory to the GPU's fast shared memory, thereby optimizing the use of tensor cores. It employs Huffman coding to compress the exponent byte of model weights, achieving significant size reductions, particularly in multilayer perceptron (MLP) weight matrices. Through multiple execution strategies and an autotuner, Unweight adapts to different workloads and batch sizes, balancing decompression and computational efforts to improve throughput. The initiative has already shown promising results with Llama 3.1-8B models, enabling cost savings and increased deployment flexibility across Cloudflare's network by reducing VRAM usage and transfer times for model distribution. This innovative approach to model weight compression opens up new possibilities for enhancing GPU efficiency and encourages further research in the field.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	8	5,932	1,046	223	-2%
Vector Search	2	1,739	413	146	-27%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.