How a GCP Persistent Disk Incident Snowballed into a 23-Hour Outage – and Taught Us Some Important Lessons

Post Details

Company

Grafana Labs

Date Published

Jan. 24, 2020

Author

Mauro Stettler

Word Count

1,544

Company Posts That Month

19

Language

English

Hacker News Points

-

Post removed?

No

Source URL

grafana.com/blog/how-a-gcp-persistent-disk-incident-snowballed-into-a-23-hour-outage-and-taught-us-some-important-lessons

Summary

Grafana Labs experienced a significant 23-hour outage affecting their Grafana Cloud Graphite service due to a Google Cloud Platform incident that degraded the performance of GCP Solid State Disks, impacting Cassandra clusters, a critical part of their infrastructure. This led to a cascade of issues, including failed write operations to Cassandra and connectivity problems with Metrictank instances. The outage required a complex recovery process involving updating Kubernetes StatefulSet configurations and restarting Metrictank instances, compounded by a longstanding deadlock bug in the Metrictank code. The incident prompted several lessons and improvements, including fixing old bugs, enhancing recovery procedures, and emphasizing the importance of global teamwork and effective monitoring. Despite the challenges, the team managed to mitigate long-term data loss, and the experience drove them to make their systems more resilient for future incidents.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	5	728	86	30	-33%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.