GitHub Availability Report: October 2020
Blog post from GitHub
In October, GitHub experienced a significant service disruption due to an issue during routine upgrades of ZooKeeper nodes, which led to the accidental creation of a second ZooKeeper cluster and a corresponding second Kafka cluster. This resulted in write failures for about 10% of requests to their background job service, causing job backups as traffic was rerouted to a secondary processing system, though no jobs were lost. The incident highlighted the need for improved procedures, prompting GitHub to update their ZooKeeper provisioning checklist and plan for automation in maintaining ZooKeeper and Kafka clusters. To enhance understanding of their infrastructure improvements, GitHub introduced the "Building GitHub" blog series, offering insights into their engineering efforts.