How we responded to a 2-hour outage in our Grafana Cloud Hosted Prometheus service

Post Details

Company

Grafana Labs

Date Published

March 26, 2021

Author

Tom Wilkie

Word Count

715

Language

English

Hacker News Points

-

Source URL

grafana.com/blog/2021/03/26/how-we-responded-to-a-2-hour-outage-in-our-grafana-cloud-hosted-prometheus-service

Summary

On March 11, a two-hour outage occurred in the Grafana Cloud Hosted Prometheus service in the US-central region due to a new tenant exceeding data limits, which a bug failed to enforce properly, causing a cascading failure across the system. The issue arose after a tenant, unaware of their excessive data transmission, overwhelmed a cluster despite preconfigured limits meant to prevent such occurrences, leading to overloaded internal paths and failures in authentication gateways and load balancers. Grafana's response included scaling up the cluster, imposing stricter tenant limits, and fixing the bug by March 12, which restored service. To prevent future incidents, they reverted the change in limit handling, introduced new protective limits, improved per-tenant limit management, and enhanced cluster scaling and provisioning processes, alongside investing in incident response training and automation. These measures aim to prevent individual customers from overwhelming clusters and reduce future recovery times.