Home / Companies / Incident.io / Blog / Post Details
Content Deep Dive

Customers over control: how we measure On-call reliability

Blog post from Incident.io

Post Details
Company
Date Published
Author
Mike Fisher
Word Count
2,164
Language
English
Hacker News Points
-
Summary

The blog post by Mike Fisher focuses on how incident.io approaches on-call reliability by prioritizing customer experience over mere technical control. It emphasizes two critical functions of their On-call product: alert ingestion and notification delivery. The company uses Service Level Indicators (SLIs) to measure alert ingestion availability and notification delivery latency, aiming for a monthly Service Level Objective (SLO) of 99.99% for both. Fisher explains how incident.io designs its systems to cope with third-party dependencies and user-configured delays, ensuring that notifications are timely and reliable even in complex scenarios. The post argues against the notion of excusing failures due to factors outside direct control, instead advocating for a proactive approach that considers customer outcomes as paramount. By embracing complexity and designing redundancy into both their systems and those of their customers, incident.io seeks to deliver a superior, reliable customer experience.