[KubeCon Recap] How to Include Latency in SLO-Based Alerting

Post Details

Company

Grafana Labs

Date Published

Nov. 27, 2019

Author

Julie Dam

Word Count

2,925

Language

English

Hacker News Points

-

Source URL

grafana.com/blog/kubecon-recap-how-to-include-latency-in-slo-based-alerting

Summary

At KubeCon + CloudNativeCon, Björn Rabenstein of Grafana Labs discussed the integration of latency into SLO-based alerting, emphasizing its importance in site reliability engineering (SRE) practices. He explained the core principles of SLOs, SLIs, and SLAs, and how these concepts are utilized to set and manage alerting thresholds effectively. By measuring error rates and incorporating latency considerations, Rabenstein highlighted the need for alert systems that can respond to both slow and fast error budget burns. He proposed using a combination of long and short time windows for error rate monitoring, which allows for a balanced and responsive alerting system. Additionally, he emphasized the importance of including latency in SLAs, advocating for a model where slow responses are equated with errors to enhance the user experience and maintain service reliability. The talk also touched on the technical implementation of these concepts at Grafana Labs, using tools like Prometheus and Jsonnet for efficient configuration and monitoring. Rabenstein concluded by stressing the value of simplicity in designing alerting systems while maintaining a focus on meaningful performance metrics.