Company
Date Published
Author
Shane Connelly
Word count
938
Language
English
Hacker News points
None

Summary

Stop Stopping`: Stopwords have been used in keyword search systems for decades but have become unreliable sources of information in different semantic search contexts, resulting in diminished search relevance and system performance. Search engines initially relied on stopwords to save resources by removing common words from their index, but this is no longer a significant concern due to the decrease in disk costs and improved compression. Many modern keyword systems still use stopwords because they are required by the "bag of words" approach used in these systems, which counts word occurrences across the entire corpus to calculate relevance scores. However, this can lead to poor performance on query side as it requires searching through a large number of documents even when only a small number are relevant. More modern approaches try to dynamically detect stopwords at query time and stop scoring terms that seem to be too saturated relative to other terms in the query. But sometimes words added as stopwords have significant information in them, especially in languages with proper names or borrowed words. This highlights the need for understanding semantic context when evaluating stopwords. Neural retrieval systems like Vectara can fully understand this context and provide contextualized relevance scoring, making the search experience better by using neural retrieval throughout all query steps.