Company
Date Published
Author
Corey Donohoe
Word count
1139
Language
English
Hacker News points
None

Summary

During the first week of January, a series of outages affected GitHub's infrastructure, primarily involving a critical fileserver, fs7, and the primary load balancer. The issues began on Monday with the failover of a xen machine hosting the load balancer, exposing internal routing and DNS resolution challenges due to an expanding network. On Tuesday, fs7 failed again, exacerbated by high load and an outdated kernel, leading to service interruptions for some customers. By Wednesday, the load on fs7 remained high due to a surge in git-http traffic, which was not managed under existing policies, causing memory overloads. With insights from Librato, GitHub identified the problem as unmanaged git-http processes and subsequently contained them within Librato's management. By Thursday, the implementation of these measures ensured a smoother operation, preventing further outages and allowing GitHub to refocus on improvements rather than crisis management. Thanks were extended to key individuals who contributed to resolving the issues, highlighting the importance of comprehensive metrics in diagnosing complex infrastructure problems.