Home / Companies / GitHub / Blog / Post Details
Content Deep Dive

GitHub Availability Report: November 2021

Blog post from GitHub

Post Details
Company
Date Published
Author
Scott Sanders
Word Count
472
Language
English
Hacker News Points
-
Summary

In November, GitHub experienced a significant incident affecting core services such as GitHub Actions, API Requests, and Git Operations, due to a novel failure mode during a schema migration on a large MySQL table. The issue arose when a schema migration's final step caused a semaphore deadlock across MySQL read replicas, leading to a crash-recovery state and insufficient active replicas to handle production requests, thus impairing service availability. In an attempt to mitigate the impact, GitHub promoted healthy internal replicas to production, but this proved inadequate for full recovery due to the crash-recovery loop. The decision was made to prioritize data integrity over availability by removing production traffic from faulty replicas until they could process the table rename successfully. No data corruption occurred, and write operations remained healthy throughout the incident. Moving forward, GitHub plans to enhance system resiliency by partitioning clusters for reduced impact during migrations and over-provisioning clusters to handle increased loads. They have paused schema migrations to further investigate the failure scenario and are working on improving migration tooling to prevent similar incidents, with ongoing updates available on the GitHub engineering blog.