How a production outage in Grafana Cloud's Hosted Prometheus service was caused by a bad etcd client setup

Post Details

Company

Grafana Labs

Date Published

April 7, 2020

Author

Tom Wilkie

Word Count

706

Company Posts That Month

21

Language

English

Hacker News Points

-

Source URL

grafana.com/blog/how-a-production-outage-in-grafana-clouds-hosted-prometheus-service-was-caused-by-a-bad-etcd-client-setup

Summary

On March 16, Grafana Cloud's Hosted Prometheus service encountered a 12-minute partial outage in the London region, caused by a misconfigured etcd client. The outage, which delayed data storage but did not result in data loss, was triggered when an etcd leader node was terminated during a scheduled Kubernetes upgrade, causing TCP connection issues with the Cortex distributor responsible for deduplicating samples. The problem was detected through SLO-based alerting and resolved by restarting the affected distributors, while ensuring the future prevention of similar incidents through better configuration of gRPC keepalive probes and enhancements in the Kubernetes upgrade process. The incident highlighted the effectiveness of Grafana Loki in facilitating swift log analysis and recovery, underscoring the importance of robust monitoring and alerting systems in maintaining service reliability.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	3	882	115	37	+3%