GitHub availability this week

Post Details

Company

GitHub

Date Published

Sept. 14, 2012

Author

Jesse Newland

Word Count

1,516

Language

English

Hacker News Points

-

Source URL

github.blog/news-insights/the-library/github-availability-this-week

Summary

GitHub experienced two significant outages early in the week due to issues with a recent database infrastructure upgrade aimed at improving high availability. The outages, totaling over three hours, were caused by an automated failover process during a database migration that resulted in system overloads and incorrect actions within the cluster management software. The first incident involved excessive load during a schema migration, leading to multiple failovers and temporary disabling of health checks to stabilize the system. The second incident involved a cluster partition that mismanaged master election processes, resulting in data drift and temporary security breaches for some private repositories. In response, GitHub's operations team has shifted to manual failover processes for critical databases, is auditing their cluster management stack, and has enhanced their status site's database capacity to better handle traffic during outages. The team is committed to refining their infrastructure to prevent future disruptions and ensure a stable user experience.