Home / Companies / Vespa / Blog / Post Details
Content Deep Dive

Re-autoresearching MSMARCO BM25, on Vespa

Blog post from Vespa

Post Details
Company
Date Published
Author
Andreas Eriksen
Word Count
2,338
Language
English
Hacker News Points
-
Summary

Interest in the BM25 retrieval algorithm has surged, with Google searches increasing and OpenAI models frequently referencing it in retrieval prompts. The renewed focus on lexical search techniques like BM25 is seen as beneficial, particularly in settings where dense embedding models struggle. An autoresearch experiment by Doug Turnbull demonstrated improvements in the BM25 model using a Python reranker, which Vespa engineers attempted to replicate with their own twist, achieving significant performance gains using existing Vespa rank features. By applying techniques such as aggressive stopword filtering, proximity scoring, and early field matching, Vespa's approach showed substantial improvements in retrieval performance on the MSMARCO passage-ranking benchmark, particularly in generalizability to larger datasets. This experiment highlights the potential for further optimization in lexical search through a blend of manual tuning and machine learning methods, underpinning the enduring relevance of BM25 in information retrieval.