Company
Date Published
Author
Will Farrington
Word count
1935
Language
English
Hacker News points
None

Summary

Last week, GitHub experienced a critical outage in its newly launched Code Search service, which was attributed to issues following an upgrade to elasticsearch version 0.20.2 and the absence of a proper staging environment. The outage did not affect other components, but it was deemed unacceptable due to its severity and duration. The problems arose from corrupted or missing data shards during cluster recovery, and high CPU loads on some nodes, leading to master role instability. GitHub collaborated with elasticsearch developers to identify misconfigurations and bugs, leading to the release of a new elasticsearch version with fixes. Additionally, an unrelated outage occurred due to human error when merging Java and elasticsearch upgrades into production. GitHub is working on improvements, including better testing environments and automation, to prevent future incidents and ensure the stability of its Code Search feature.