Home / Companies / Grafana Labs / Blog / Post Details
Content Deep Dive

How a production outage in Grafana Cloud's Hosted Prometheus service was caused by a bad etcd client setup

Blog post from Grafana Labs

Post Details
Company
Date Published
Author
Tom Wilkie
Word Count
706
Language
English
Hacker News Points
-
Summary

On March 16, Grafana Cloud's Hosted Prometheus service encountered a 12-minute partial outage in the London region, caused by a misconfigured etcd client. The outage, which delayed data storage but did not result in data loss, was triggered when an etcd leader node was terminated during a scheduled Kubernetes upgrade, causing TCP connection issues with the Cortex distributor responsible for deduplicating samples. The problem was detected through SLO-based alerting and resolved by restarting the affected distributors, while ensuring the future prevention of similar incidents through better configuration of gRPC keepalive probes and enhancements in the Kubernetes upgrade process. The incident highlighted the effectiveness of Grafana Loki in facilitating swift log analysis and recovery, underscoring the importance of robust monitoring and alerting systems in maintaining service reliability.