How a Production Outage Was Caused Using Kubernetes Pod Priorities

Post Details

Company

Grafana Labs

Date Published

July 24, 2019

Author

Tom Wilkie

Word Count

1,137

Company Posts That Month

21

Language

English

Hacker News Points

-

Post removed?

No

Source URL

grafana.com/blog/how-a-production-outage-was-caused-using-kubernetes-pod-priorities

Summary

On July 19, Grafana Cloud experienced a 30-minute outage in its Hosted Prometheus service due to issues with Kubernetes Pod Priorities, particularly when a new Cortex cluster was deployed without updated priority configurations. The incident resulted from the preemption of production Ingesters due to an oversight in setting priority levels, leading to a cascading failure of the Cortex microservices architecture. Prompt alerts allowed engineers to diagnose and resolve the issue by scaling the Kubernetes cluster to accommodate both new and existing workloads. Despite the outage, no data was lost, and customers’ Prometheus servers effectively buffered and replayed writes. Grafana Labs outlined several measures to prevent future occurrences, including refining configuration processes, requiring design documents for significant changes, and automating proxy sizing to handle overloads. The incident also provided valuable insights into the automatic recovery capabilities of Cortex and the utility of Grafana Loki for log verification.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	8	794	81	33	+103%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.