Company
Date Published
Author
Tom Wilkie
Word count
1137
Language
English
Hacker News points
None

Summary

On July 19, Grafana Cloud experienced a 30-minute outage in its Hosted Prometheus service due to issues with Kubernetes Pod Priorities, particularly when a new Cortex cluster was deployed without updated priority configurations. The incident resulted from the preemption of production Ingesters due to an oversight in setting priority levels, leading to a cascading failure of the Cortex microservices architecture. Prompt alerts allowed engineers to diagnose and resolve the issue by scaling the Kubernetes cluster to accommodate both new and existing workloads. Despite the outage, no data was lost, and customers’ Prometheus servers effectively buffered and replayed writes. Grafana Labs outlined several measures to prevent future occurrences, including refining configuration processes, requiring design documents for significant changes, and automating proxy sizing to handle overloads. The incident also provided valuable insights into the automatic recovery capabilities of Cortex and the utility of Grafana Loki for log verification.