Vision Tokens vs Text Tokens: Understanding the 10× Compression

Company

HuggingFace

Date Published

Oct. 22, 2025

Author

Yi Cui

Word count

535

Language

Hacker News points

None

URL

huggingface.co/blog/onekq/behind-each-token

Summary

DeepSeek-OCR demonstrates a significant compression capability where 100 vision tokens can represent approximately 1000 text tokens with over 97% accuracy, suggesting a 10× compression ratio. This compression is possible due to the fundamental differences between vision and text tokens; vision tokens encapsulate much more information, such as words, layout, font style, and size, within a 64×64 pixel area, compared to text tokens that typically represent single words. Despite the varying information density, both vision and text tokens are mapped to the same 4096-dimensional space, which provides a rich representation for capturing semantic relationships. While text tokens go through a vocabulary space to reach this dimension, vision tokens undergo direct compression, making the process seamless and continuous without the need for vocabulary expansion. This approach highlights how vision tokens can efficiently compress and represent a substantial amount of information compared to text tokens.