Company
Date Published
Author
Jorge Creixell and Manoj Acharya
Word count
1832
Language
English
Hacker News points
None

Summary

Grafana Labs has developed an anomaly detection framework using PromQL to enhance incident investigation by providing crucial context quickly, which is essential during time-sensitive scenarios like on-call alerts. The framework, designed to work seamlessly with Prometheus-compatible systems, operates without external dependencies and is scalable to handle large metric data volumes. Initially based on the z-score formula for anomaly detection, the framework uses Prometheus recording rules to establish baselines and detect anomalies by setting upper and lower behavior bands. Challenges such as extreme outliers, low sensitivity, and discontinuities were addressed by introducing smoothing functions, filtering low variability periods, and defining minimum margins. The framework also accounts for long-term recurrent patterns by predicting behavior based on past data. Users can implement the framework by adding specific recording and alerting rules to their Prometheus instance, and Grafana Labs encourages feedback for future enhancements. The framework's effectiveness is demonstrated within Grafana Cloud, where it integrates with SLO-based alerts to provide actionable insights and accelerate root-cause analysis.