Downtime last Saturday

Company

GitHub

Date Published

Dec. 26, 2012

Author

Mark Imbriaco

Word count

1789

Language

English

Hacker News points

None

URL

github.blog/news-insights/the-library/downtime-last-saturday

Summary

On December 22nd, GitHub experienced a significant outage during a scheduled maintenance window intended for software updates on their aggregation switches, which were recommended by their network vendor to address previous issues. Despite prior successful tests, unexpected instability arose due to complications with the in-service software upgrade process and MLAG features, leading to network disruptions. This instability triggered failover actions in their fileserver architecture, resulting in both nodes of several fileserver pairs attempting to be active simultaneously, causing a "split-brain" scenario and requiring a comprehensive recovery effort that lasted over five hours. GitHub has since taken steps to prevent similar issues, including postponing further software upgrades until a staging environment is established, revisiting failover timeouts with their vendor, and adjusting their high availability configurations and failover procedures to enhance reliability and prevent future disruptions.