Company
Date Published
Author
Callum Styan
Word count
2081
Language
English
Hacker News points
None

Summary

Callum Styan from Grafana Labs discusses a service outage in Grafana Cloud Logs caused by the incorrect use of Kubernetes label selectors during a migration process. The incident occurred while transitioning Grafana Cloud Logs clusters from a separate key-value store to an embedded memberlist to eliminate a single point of failure. An additional label selector was mistakenly added, leading to misdirected service discovery and overwhelming a portion of the pods, resulting in a 25-minute outage. The outage was resolved by scaling distributor deployments horizontally, which spread the load more evenly, and by reverting the configuration change. The investigation revealed that the issue was due to the unintentional inclusion of label selectors on services that filtered pods incorrectly, compounded by the unique scaling configurations of different environments. The team used extensive metrics and logging to identify the problem and has since updated their configurations to prevent similar issues in the future. The article serves as a learning resource for others using Kubernetes label selectors.