Home / Companies / Grafana Labs / Blog / Post Details
Content Deep Dive

How we responded to a 2+ hour partial outage in Grafana Cloud

Blog post from Grafana Labs

Post Details
Company
Date Published
Author
Mick Gregg
Word Count
1,313
Language
English
Hacker News Points
-
Summary

On February 18, 2025, Grafana Cloud experienced a 150-minute outage affecting 25% of its services due to a configuration change in their TLS policies, which led to the loss of load balancers. This incident, caused by the company's failure to adhere to its standard testing and deployment practices, did not result in any security breaches or data leaks, but some customers were unable to access services and potentially lost data. Grafana Labs emphasizes its commitment to transparency and learning from mistakes, detailing how the outage was resolved by rolling back the change and recreating affected services. Moving forward, the company plans to enforce stricter deployment controls, including mandatory use of deployment waves and enhanced deletion protections, to prevent similar incidents. They have implemented a CI check to restrict deployments to one wave at a time and are improving change validation through their Crossplane Managed Resources exporter, which is now in beta testing.