NanoVDR: A 70M Text-Only Model That Retrieves Visual Documents as Well as a 2B VLM

Post Details

Company

Hugging Face

Date Published

March 16, 2026

Author

Zhuchenyang Liu

Word Count

1,493

Company Posts That Month

63

Language

-

Hacker News Points

-

Post removed?

No

Source URL

huggingface.co/blog/Ryenhails/nanovdr

Summary

NanoVDR is a 70 million parameter text-only model designed for visual document retrieval, offering efficiency and performance comparable to much larger vision-language models (VLMs) like ColPali and DSE-Qwen2. By exploiting the asymmetry between text queries and visual documents, NanoVDR uses a lightweight DistilBERT model for query encoding, which is significantly faster and more storage-efficient than traditional VLMs. The model is trained to map text queries into a visual embedding space using a pre-trained VLM teacher model for document indexing, allowing for rapid retrieval without the need for images during training or inference. This approach results in a 50 to 143 times reduction in query latency and a 64 times decrease in index storage requirements, while maintaining high retrieval performance across multiple datasets. The language coverage of training data rather than document complexity emerges as the primary performance bottleneck, which can be mitigated by augmenting the training data with translated queries. NanoVDR's design demonstrates the potential of asymmetric architectures for tasks like audio search and cross-lingual information retrieval, where queries and documents differ in modality.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Vector Search	13	2,370	415	145	+7%
LLM	1	6,078	960	218	+18%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.