Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Post Details

Company

HuggingFace

Date Published

April 16, 2026

Author

Tom Aarsen

Word Count

3,791

Company Posts That Month

61

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/train-multimodal-sentence-transformers

Summary

Tom Aarsen's blog post delves into the training and fine-tuning of multimodal embedding models using the Sentence Transformers library, showcasing its potential in various applications like semantic search and retrieval augmented generation. The article highlights the practical example of fine-tuning the Qwen/Qwen3-VL-Embedding-2B model for Visual Document Retrieval (VDR), demonstrating a significant performance boost from an NDCG@10 score of 0.888 to 0.947, outperforming larger models. The process involves using components such as a model, dataset, and specific loss functions like CachedMultipleNegativesRankingLoss and MatryoshkaLoss, which enhance model capabilities across multiple dimensions. The post provides insights into model architecture, dataset preparation, and efficient training techniques, emphasizing the benefits of domain-specific fine-tuning over using larger general-purpose models. Additionally, it introduces alternative methods like the Router module for building multimodal models and discusses the evaluation metrics used to track model performance. The blog post serves as a comprehensive guide for those interested in leveraging Sentence Transformers for multimodal tasks, offering detailed information on training setup, arguments, and results.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Vector Search	50	1,739	413	146	-27%
AI Model Fine-tuning	12	420	130	55	-54%
LLM	3	5,932	1,046	223	-2%
RAG	1	941	216	85	-48%