Alerting on the User Experience
Blog post from Honeycomb
Determining who should be on call for alerts in systems owned by different teams is a complex and socio-technical challenge, as effective Service Level Objectives (SLOs) often measure interactions between multiple services. The blog post explores various strategies for handling alerting, including not paging anyone at night, paging the on-call engineer for every alert, having a designated team investigate initially before involving others, and using a "switchboard" pattern with a rotation of Site Reliability Engineers (SREs) to manage alerts. Automation can assist in routing alerts by synthesizing signals, but careful definition of fallbacks is essential as novel failures may still require human intervention. The post emphasizes the importance of aligning SLOs with user experience to ensure alerts are meaningful and reflects on Honeycomb's internal practices, such as using E2E monitoring to ensure data accessibility, illustrating the application of the switchboard pattern to address complex service interactions. While there is no one-size-fits-all solution, understanding team dynamics and organizational culture is crucial to developing an effective alerting strategy, and Honeycomb's approach is presented as a potential model to enhance on-call engineer experiences.