Company
Date Published
Author
Scott Sanders
Word count
1358
Language
English
Hacker News points
None

Summary

GitHub experienced a two-hour and six-minute outage due to a power disruption in their primary data center, which caused over 25% of servers and several network devices to reboot, leading to a cascading failure. Initial delays in response were exacerbated by rebooted ChatOps systems and a misunderstanding about a potential DDoS attack. Engineers worked to restore service by repairing booting issues and rebuilding Redis clusters on alternate hardware, eventually achieving recovery without data loss. Moving forward, GitHub plans to update firmware, improve dependency testing, enhance internal communication, and strengthen messaging to users to mitigate future incidents and improve recovery strategies.