Service Disruption Timeline for October 21st, 2016
Blog post from PagerDuty
On October 21, 2016, PagerDuty experienced an outage and subsequent service degradation due to issues with its primary DNS provider, impacting both external customer access and internal server communication. The company utilizes DNS primarily for routing customer access and internal server communication, and during the outage, engineers quickly identified and confirmed lookup failures, leading to a service disruption for a subset of customers. To mitigate the issue, engineers transitioned from the primary to a secondary DNS provider and manually updated internal server configurations, gradually restoring services over a few hours. Following the recovery, engineers addressed redundant notifications and worked to clear a backlog of events, while committing to releasing a follow-up outlining steps to enhance DNS infrastructure resilience to prevent similar occurrences in the future.