GitHub Availability Report: April 2024
Blog post from GitHub
In April, GitHub experienced four significant incidents that affected service performance, including degraded functionality and high error rates across various services. On April 5, a change in the database load balancer led to connection failures, causing widespread disruption, which was resolved by reverting the change and introducing early detection measures. On April 10, two separate incidents occurred: one involving an overloaded primary database due to an unbounded query, leading to increased error rates and impacting repository management and search functions; the other caused by a compute-intensive query that disrupted key database operations, affecting GitHub Actions, API requests, Pages deployments, and more. Both incidents were mitigated by altering queries and enhancing resilience mechanisms. From April 11 to 14, email delivery delays were experienced due to issues with a shared resource pool and an unhealthy job queue, affecting password resets and device verification for users without two-factor authentication. Immediate improvements were made to enhance detection and management of such issues, including implementing a queue-bypass for time-sensitive emails and pausing the problematic job queue to protect shared resources.