Company
Date Published
Author
Jakub Oleksy
Word count
425
Language
English
Hacker News points
None

Summary

In January 2025, GitHub experienced three significant service disruptions due to various infrastructure issues, which the company swiftly addressed and is actively working to prevent in the future. The first incident on January 9 was caused by a deployment introducing a problematic query that overloaded a primary database server, resulting in a 6.85% peak error rate. GitHub mitigated this by rolling back the deployment and plans to enhance tools for early detection of such queries. The second disruption on January 13 stemmed from a configuration change affecting Git operations, resolved by reverting the change, with improvements underway in monitoring and deployment practices. The final incident on January 30 involved a hardware failure in the caching layer, leading to a peak error rate of 44%, which was prolonged due to a lack of automated failover; GitHub plans to implement a high availability cache configuration to enhance resilience.