GitHub Availability Report: November 2021
Blog post from GitHub
In November, GitHub experienced a significant incident affecting core services such as GitHub Actions, API Requests, and Git Operations, due to a novel failure mode during a schema migration on a large MySQL table. The issue arose when a schema migration's final step caused a semaphore deadlock across MySQL read replicas, leading to a crash-recovery state and insufficient active replicas to handle production requests, thus impairing service availability. In an attempt to mitigate the impact, GitHub promoted healthy internal replicas to production, but this proved inadequate for full recovery due to the crash-recovery loop. The decision was made to prioritize data integrity over availability by removing production traffic from faulty replicas until they could process the table rename successfully. No data corruption occurred, and write operations remained healthy throughout the incident. Moving forward, GitHub plans to enhance system resiliency by partitioning clusters for reduced impact during migrations and over-provisioning clusters to handle increased loads. They have paused schema migrations to further investigate the failure scenario and are working on improving migration tooling to prevent similar incidents, with ongoing updates available on the GitHub engineering blog.