Home / Companies / PagerDuty / Blog / Post Details
Content Deep Dive

Service Disruption Timeline for October 21st, 2016

Blog post from PagerDuty

Post Details
Company
Date Published
Author
Tim Armandpour
Word Count
524
Language
English
Hacker News Points
-
Summary

On October 21, 2016, PagerDuty experienced an outage and subsequent service degradation due to issues with its primary DNS provider, impacting both external customer access and internal server communication. The company utilizes DNS primarily for routing customer access and internal server communication, and during the outage, engineers quickly identified and confirmed lookup failures, leading to a service disruption for a subset of customers. To mitigate the issue, engineers transitioned from the primary to a secondary DNS provider and manually updated internal server configurations, gradually restoring services over a few hours. Following the recovery, engineers addressed redundant notifications and worked to clear a backlog of events, while committing to releasing a follow-up outlining steps to enhance DNS infrastructure resilience to prevent similar occurrences in the future.