How GitHub Docs’ new search works

Post Details

Company

GitHub

Date Published

March 9, 2023

Author

Peter Bengtsson

Word Count

2,390

Language

English

Hacker News Points

-

Source URL

github.blog/engineering/architecture-optimization/how-github-docs-new-search-works

Summary

GitHub Docs recently transitioned from an in-memory site-search solution to Elasticsearch to address scalability issues as the platform expanded. The previous system struggled with loading all searchable text into memory, necessitating a shift to a more robust solution. Elasticsearch was chosen for its ability to run locally, which simplifies the debugging process for engineers. The new implementation involves a single query to Elasticsearch that ranks search results using boosts and matching techniques, tailored to whether queries are single or multi-term. The search strategy emphasizes relevance by using a matrix of fields and analyzers, incorporating both explicit and regular matches, and applying varying boost levels to prioritize results based on content, title, and heading matches. Popularity metrics from pageviews further refine the ranking, ensuring that frequently accessed content is prioritized, although there is an ongoing effort to balance this with algorithmic adjustments to prevent popular yet less relevant results from dominating. Future directions include exploring synonyms and contextual variables to enhance search precision and incorporating user feedback to continually refine the search experience.