Company
Date Published
Author
Julian Maurin
Word count
2246
Language
English
Hacker News points
None

Summary

A team faced persistent noisy alerts from their database monitors due to predictable jobs, particularly a morning purge job that triggered alerts about high Database Disk IOPS without actual operational impact. The initial instinct to adjust alert thresholds proved ineffective, as it suppressed legitimate alerts and failed to address the core issue. The solution involved adopting a Service Level Objective (SLO)-based approach, which focused on system reliability rather than static thresholds. By reframing the problem, the team maintained the original metric but used it to measure reliability over time, setting a 98% SLO to accommodate predictable workload spikes. This approach allowed for meaningful alerting, notifying the team only when real reliability degradation occurred, thus reducing false positives and enhancing observability. The transition to SLOs shifted the focus from arbitrary threshold tuning to reliability outcomes, turning a previously untrustworthy metric into a valuable signal, and ultimately stopped the monitor from "crying wolf."