Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
Blog post from HuggingFace
Tom Aarsen's blog post delves into the training and fine-tuning of multimodal embedding models using the Sentence Transformers library, showcasing its potential in various applications like semantic search and retrieval augmented generation. The article highlights the practical example of fine-tuning the Qwen/Qwen3-VL-Embedding-2B model for Visual Document Retrieval (VDR), demonstrating a significant performance boost from an NDCG@10 score of 0.888 to 0.947, outperforming larger models. The process involves using components such as a model, dataset, and specific loss functions like CachedMultipleNegativesRankingLoss and MatryoshkaLoss, which enhance model capabilities across multiple dimensions. The post provides insights into model architecture, dataset preparation, and efficient training techniques, emphasizing the benefits of domain-specific fine-tuning over using larger general-purpose models. Additionally, it introduces alternative methods like the Router module for building multimodal models and discusses the evaluation metrics used to track model performance. The blog post serves as a comprehensive guide for those interested in leveraging Sentence Transformers for multimodal tasks, offering detailed information on training setup, arguments, and results.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| Vector Search | 50 | 1,739 | 413 | 146 | -27% |
| AI Model Fine-tuning | 12 | 420 | 130 | 55 | -54% |
| LLM | 3 | 5,932 | 1,046 | 223 | -2% |
| RAG | 1 | 941 | 216 | 85 | -48% |