voyage-multimodal-3: all-in-one embedding model for interleaved text, images, and screenshots
Blog post from Voyage AI
Voyage-multimodal-3 is a new state-of-the-art model for multimodal embeddings, offering significant advancements in retrieval accuracy for documents that include both text and visuals. It can vectorize interleaved text and images, capturing essential visual features from sources like PDFs, slides, and tables without the need for complex document parsing. This model outperforms existing ones, such as OpenAI's CLIP and Cohere's multimodal v3, by significant margins across various tasks, including table/figure retrieval, document screenshot retrieval, and text-to-photo retrieval. Voyage-multimodal-3 processes all input modalities through the same transformer encoder, ensuring a unified representation that maintains the contextual relationship between textual and visual information. Its architecture, resembling modern vision-language transformers, allows for greater flexibility and accuracy in mixed-modality searches, overcoming the limitations of models that process different data modalities through separate networks. The model is especially robust in handling datasets with high proportions of screenshots, maintaining performance where others falter, and is available for use with the first 200 million tokens offered for free.