DeepSeek-OCR Explained: Optical Compression for Scalable Long-Context and RAG Systems
Blog post from Zilliz
DeepSeek-OCR is an innovative open-source model designed to enhance the processing of long contexts in large language models (LLMs) by utilizing a method called Contexts Optical Compression. This approach transforms text into visual tokens by converting pages of text into images, which contain as much information as thousands of text tokens, thus enabling the model to handle extensive documents more efficiently. The technique addresses the limitations of traditional token-based methods, such as high computational costs, loss of focus, and the inability to retain document structure in multimodal texts. The model employs a DeepEncoder to compress document images into compact visual tokens and an MoE Decoder to reconstruct the text while preserving accuracy and structure. This method not only reduces the computational load but also improves processing efficiency for multilingual and multimodal documents. Moreover, DeepSeek-OCR's ability to manage context adaptively and its potential to reshape retrieval-augmented generation (RAG) systems by streamlining multimodal processing highlight its significance in advancing the capabilities of LLMs.