Home / Companies / PagerDuty / Blog / Post Details
Content Deep Dive

Service Disruption Root Cause Analysis and Follow-up Actions from October 21st, 2016

Blog post from PagerDuty

Post Details
Company
Date Published
Author
Tim Armandpour
Word Count
847
Language
English
Hacker News Points
-
Summary

PagerDuty is responding to a recent outage by addressing two primary issues: the failover approach to DNS problems and the quality of monitoring for the end-to-end customer experience. The company plans to redesign its DNS architecture to implement a multi-master approach utilizing multiple DNS providers, audit DNS TTLs for consistency across its website, APIs, and mobile applications, and develop a runbook for DNS cache flushing. Additionally, PagerDuty aims to enhance real user monitoring with a global perspective and improve the prioritization of resolution steps during disruptions, focusing on critical services. The company also intends to refine its multi-team response process to ensure effective problem-solving by on-call teams. These actions are part of PagerDuty's commitment to enhancing the reliability and availability of its services to meet customer expectations.