GitHub Availability Report: July 2020

Post Details

Company

GitHub

Date Published

Aug. 5, 2020

Author

Keith Ballinger

Word Count

418

Language

English

Hacker News Points

-

Source URL

github.blog/news-insights/company-news/github-availability-report-july-2020

Summary

In July, GitHub.com experienced a significant service disruption due to a Kubernetes incident where production Pods were marked as unavailable, resulting in reduced capacity and service downtime. The issue stemmed from a container exceeding its memory limits, leading to its termination, compounded by a DNS maintenance operation that prevented Kubernetes from fetching new container images, causing Pods to fail to start. Efforts to mitigate the situation initially exacerbated the problem, but services were restored after utilizing cached DNS records. In response, GitHub plans to enhance monitoring, reduce dependency on the image registry, improve DNS change validation, reassess Kubernetes deployment policies, and develop a more incremental approach to deployments as part of a broader reliability initiative.