voyage-multimodal-3.5: a new multimodal retrieval frontier with video support
Blog post from Voyage AI
Voyage-multimodal-3.5 is a new advanced multimodal embedding model designed for improved retrieval of text, images, and videos, building on its predecessor, voyage-multimodal-3. It introduces explicit video frame support and maintains a unified transformer encoder architecture that processes both visual and textual inputs together, avoiding the modality gap seen in CLIP-based models. This model achieves higher retrieval accuracy compared to Cohere Embed v4 and Google Multimodal Embedding 001 across various datasets, including visual document and video retrieval tasks, while also performing competitively on standard text retrieval. It features Matryoshka embeddings for flexible dimensionality and offers multiple quantization options to minimize quality loss. The model is available with token-based pricing and offers free usage up to certain limits, providing tools for embedding videos effectively and improving retrieval pipelines.