From Incident Counting to SLIs: How DigitalOcean Rethought Availability
Blog post from DigitalOcean
In an effort to better align internal metrics with customer experience, DigitalOcean undertook a significant overhaul of its availability measurement system, shifting from an incident-based approach to one focused on Service Level Indicators (SLIs). The previous metric, which calculated availability based on incident duration, failed to accurately reflect customer experiences, often misrepresenting partial degradations as total outages. This prompted the company to adopt a two-pronged measurement system, distinguishing between the Control Plane, which handles orchestration and API calls, and the Data Plane, which covers live product instances. Each plane employs distinct methodologies tailored to the specific nature of failures, allowing for more precise and meaningful assessments of service availability. This new framework not only enhances the accuracy of availability metrics but also facilitates better comparisons with industry standards, as it mirrors the control and data plane distinction used by other cloud providers. Additionally, the new system incorporates traffic volume weighting to ensure that the impact of failures is proportionate to their significance across different regions.