Practical BM25 - Part 1: How Shards Affect Relevance Scoring in Elasticsearch

Post Details

Company

Elastic

Date Published

April 19, 2018

Author

-

Word Count

1,067

Language

-

Hacker News Points

-

Source URL

www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch

Summary

The blog post is the first in a three-part series focused on the practical application of the BM25 similarity algorithm in Elasticsearch, particularly examining how the number of shards influences relevance scoring in text document searches. Elasticsearch defaults to five primary shards per index, and this setup can affect scoring because each shard holds a subset of the data, leading to variations in term frequencies across shards. For example, when searching for documents with the title "Shane," the distribution of documents across shards results in different relevance scores, even for similar entries. To achieve more consistent scoring, users can reduce the number of shards or employ the `search_type=dfs_query_then_fetch` parameter, which calculates scores by first gathering distributed term frequencies across shards. However, this method is not enabled by default due to the additional processing time required, making it unnecessary for cases prioritizing speed over scoring precision. The post sets the stage for deeper exploration of the BM25 algorithm and its variables in subsequent parts of the series.