Company
Date Published
Author
Yi Cui
Word count
535
Language
-
Hacker News points
None

Summary

DeepSeek-OCR demonstrates a significant compression capability where 100 vision tokens can represent approximately 1000 text tokens with over 97% accuracy, suggesting a 10× compression ratio. This compression is possible due to the fundamental differences between vision and text tokens; vision tokens encapsulate much more information, such as words, layout, font style, and size, within a 64×64 pixel area, compared to text tokens that typically represent single words. Despite the varying information density, both vision and text tokens are mapped to the same 4096-dimensional space, which provides a rich representation for capturing semantic relationships. While text tokens go through a vocabulary space to reach this dimension, vision tokens undergo direct compression, making the process seamless and continuous without the need for vocabulary expansion. This approach highlights how vision tokens can efficiently compress and represent a substantial amount of information compared to text tokens.