How We Built a Semantic Highlight Model To Save Token Cost for RAG
Blog post from HuggingFace
A bilingual Semantic Highlight model has been developed and open-sourced to improve production RAG (retrieval-augmented generation) systems by reducing token costs and enhancing answer quality. This model, which operates on both English and Chinese, highlights semantically relevant sentences in documents, improving data interpretation and reducing irrelevant information. Unlike traditional keyword-based highlighting, this model uses a 0.6B encoder-only architecture to efficiently identify sentences that semantically address queries, even without keyword matches. It achieves a 70-80% token cost reduction and better answer quality by focusing on relevant content. Existing models like OpenSearch and Naver's Provence/XProvence were found inadequate due to limitations in context window size, language support, and commercial licensing. The new model, based on BGE-M3 Reranker v2, enhances performance through LLM-generated training data, achieving state-of-the-art results on both English and Chinese datasets. This model is released under the MIT license, allowing commercial use, and provides a foundation for developing more cost-effective and interpretable RAG systems.