The technology behind GitHub’s new code search
Blog post from GitHub
GitHub's new code search engine, Blackbird, was built from scratch using Rust to address the unique challenges of searching vast and constantly changing codebases, which existing solutions could not handle effectively. The decision to create a custom engine was driven by the need for a more efficient user experience, allowing developers to ask questions of the code and receive answers through iterative searching and navigation. Unlike general text search engines, Blackbird supports search requirements specific to code, such as handling punctuation and regular expressions, without stemming or removing stop words. The architecture of Blackbird involves the use of ngram indices, delta encoding, and a sharding strategy to manage and index over 200 million repositories efficiently. The system uses Kafka for asynchronous processing, ensuring query consistency even as code changes, and its optimized indexing allows for faster queries at scale. By reducing the data footprint with techniques like content deduplication and delta indexing, the new system significantly enhances search performance, offering a robust solution for developers navigating GitHub's extensive codebase.