NanoVDR: A 70M Text-Only Model That Retrieves Visual Documents as Well as a 2B VLM
Blog post from HuggingFace
NanoVDR is a 70 million parameter text-only model designed for visual document retrieval, offering efficiency and performance comparable to much larger vision-language models (VLMs) like ColPali and DSE-Qwen2. By exploiting the asymmetry between text queries and visual documents, NanoVDR uses a lightweight DistilBERT model for query encoding, which is significantly faster and more storage-efficient than traditional VLMs. The model is trained to map text queries into a visual embedding space using a pre-trained VLM teacher model for document indexing, allowing for rapid retrieval without the need for images during training or inference. This approach results in a 50 to 143 times reduction in query latency and a 64 times decrease in index storage requirements, while maintaining high retrieval performance across multiple datasets. The language coverage of training data rather than document complexity emerges as the primary performance bottleneck, which can be mitigated by augmenting the training data with translated queries. NanoVDR's design demonstrates the potential of asymmetric architectures for tasks like audio search and cross-lingual information retrieval, where queries and documents differ in modality.