Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Tom Aarsen
Word Count
3,791
Company Posts That Month
61
Language
-
Hacker News Points
-
Summary

Tom Aarsen's blog post delves into the training and fine-tuning of multimodal embedding models using the Sentence Transformers library, showcasing its potential in various applications like semantic search and retrieval augmented generation. The article highlights the practical example of fine-tuning the Qwen/Qwen3-VL-Embedding-2B model for Visual Document Retrieval (VDR), demonstrating a significant performance boost from an NDCG@10 score of 0.888 to 0.947, outperforming larger models. The process involves using components such as a model, dataset, and specific loss functions like CachedMultipleNegativesRankingLoss and MatryoshkaLoss, which enhance model capabilities across multiple dimensions. The post provides insights into model architecture, dataset preparation, and efficient training techniques, emphasizing the benefits of domain-specific fine-tuning over using larger general-purpose models. Additionally, it introduces alternative methods like the Router module for building multimodal models and discusses the evaluation metrics used to track model performance. The blog post serves as a comprehensive guide for those interested in leveraging Sentence Transformers for multimodal tasks, offering detailed information on training setup, arguments, and results.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Vector Search 50 1,739 413 146 -27%
AI Model Fine-tuning 12 420 130 55 -54%
LLM 3 5,932 1,046 223 -2%
RAG 1 941 216 85 -48%