Company
Date Published
Author
Mauro Stettler
Word count
1544
Language
English
Hacker News points
None

Summary

A 23-hour outage at Grafana Labs was triggered by a Google Cloud Platform incident that severely impacted the performance of Solid State Disks, leading to failures in their Cassandra-backed Grafana Cloud Graphite service. The outage affected customers in the US-East cluster, as 20% of queries failed due to issues with data retrieval from Cassandra when in-memory caches were insufficient. The problem was exacerbated by Kubernetes' "OrderedReady" policy, which prevented the Cassandra cluster from fully restarting. After the GCP issue was resolved, connectivity problems arose because Metrictank instances couldn't reconnect to Cassandra due to IP address changes, necessitating a restart of all instances. The recovery process was further delayed by a bug in Metrictank related to data queue handling, which was ultimately fixed by using atomics instead of locks. The incident offered significant learning opportunities, leading to code improvements and highlighting the effectiveness of Grafana Labs' monitoring and global team collaboration. The team was able to fix longstanding issues, ensuring that future recoveries should be quicker and more efficient.