How a GCP Persistent Disk Incident Snowballed into a 23-Hour Outage -- and Taught Us Some Important Lessons

Post Details

Company

Grafana Labs

Date Published

Jan. 23, 2020

Author

Mauro Stettler

Word Count

1,544

Company Posts That Month

19

Language

English

Hacker News Points

-

Post removed?

Yes

Source URL

grafana.com/blog/2020/01/23/how-a-gcp-persistent-disk-incident-snowballed-into-a-23-hour-outage--and-taught-us-some-important-lessons

Summary

A 23-hour outage at Grafana Labs was triggered by a Google Cloud Platform incident that severely impacted the performance of Solid State Disks, leading to failures in their Cassandra-backed Grafana Cloud Graphite service. The outage affected customers in the US-East cluster, as 20% of queries failed due to issues with data retrieval from Cassandra when in-memory caches were insufficient. The problem was exacerbated by Kubernetes' "OrderedReady" policy, which prevented the Cassandra cluster from fully restarting. After the GCP issue was resolved, connectivity problems arose because Metrictank instances couldn't reconnect to Cassandra due to IP address changes, necessitating a restart of all instances. The recovery process was further delayed by a bug in Metrictank related to data queue handling, which was ultimately fixed by using atomics instead of locks. The incident offered significant learning opportunities, leading to code improvements and highlighting the effectiveness of Grafana Labs' monitoring and global team collaboration. The team was able to fix longstanding issues, ensuring that future recoveries should be quicker and more efficient.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	5	728	86	30	-33%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.

How a GCP Persistent Disk Incident Snowballed into a 23-Hour Outage -- and Taught Us Some Important LessonsRemoved