BM42: New Baseline for Hybrid Search
Blog post from Qdrant
BM42 is introduced as an evolution in lexical search, aiming to combine the strengths of the traditional BM25 algorithm with the capabilities of modern transformer models to address the limitations of existing search systems like SPLADE. While BM25 has been a cornerstone in search algorithms due to its effective term importance calculation, the rise of dense embeddings and hybrid search systems highlighted its limitations, particularly in scenarios like Retrieval-Augmented Generation (RAG), where document lengths are shorter. BM42 retains the inverse document frequency (IDF) component of BM25 but introduces a novel approach to term importance using transformer-derived attention matrices, allowing for more semantic relevance in document scoring. It addresses tokenization challenges by retokenizing after attention weight extraction, thereby enhancing accuracy without extensive re-training. Although BM42 demonstrates improvements in areas like query inference speed and memory footprint, it acknowledges the continued relevance of BM25 for larger documents and emphasizes the benefits of hybrid models that leverage both sparse and dense embeddings to optimize search results across various contexts.