Home / Companies / Grafana Labs / Blog / Post Details
Content Deep Dive

How a GCP Persistent Disk Incident Snowballed into a 23-Hour Outage – and Taught Us Some Important Lessons

Blog post from Grafana Labs

Post Details
Company
Date Published
Author
Mauro Stettler
Word Count
1,544
Language
English
Hacker News Points
-
Summary

Grafana Labs experienced a significant 23-hour outage affecting their Grafana Cloud Graphite service due to a Google Cloud Platform incident that degraded the performance of GCP Solid State Disks, impacting Cassandra clusters, a critical part of their infrastructure. This led to a cascade of issues, including failed write operations to Cassandra and connectivity problems with Metrictank instances. The outage required a complex recovery process involving updating Kubernetes StatefulSet configurations and restarting Metrictank instances, compounded by a longstanding deadlock bug in the Metrictank code. The incident prompted several lessons and improvements, including fixing old bugs, enhancing recovery procedures, and emphasizing the importance of global teamwork and effective monitoring. Despite the challenges, the team managed to mitigate long-term data loss, and the experience drove them to make their systems more resilient for future incidents.