Small Yet Mighty: Improve Accuracy In Multimodal Search and Visual Document Retrieval with Llama Nemotron RAG Models
Blog post from HuggingFace
The article explores the use of small Llama Nemotron models, specifically llama-nemotron-embed-vl-1b-v2 and llama-nemotron-rerank-vl-1b-v2, for improving multimodal search and visual document retrieval in enterprise settings. These models are designed to work with standard vector databases and are capable of processing both textual and visual data, thereby enhancing the accuracy and relevance of search results across various document types, such as PDFs with charts and scanned contracts. The models utilize a bi-encoder architecture for embedding and a cross-encoder for reranking, both employing contrastive learning for improved retrieval performance. Evaluations on several datasets, including DigitalCorpora-10k and Earnings V2, demonstrate that these models offer significant improvements in retrieval accuracy, especially when combining text and image modalities. The article highlights the practical applications of these models in organizations like Cadence, IBM, and ServiceNow, where they are used to enhance document understanding and streamline workflows. The piece also emphasizes the models' commercial licensing advantage, making them suitable for enterprise deployment without the restrictions seen in some competing models.