Outage Post-Mortem – May 30, 2013

Company

PagerDuty

Date Published

June 7, 2013

Author

Ryan Duffield

Word count

433

Language

English

Hacker News points

None

URL

www.pagerduty.com/blog/uncategorized/outage-post-mortem-may-30-2013

Summary

PagerDuty's realtime engineering team experienced a brief outage on May 30, 2013, due to network latency issues at the Linode Fremont datacenter, which led to degraded alerting reliability. The outage triggered backup worker processes meant to handle notification queues, but these processes had poor error handling, causing delays in 7% of outgoing alerts, although all notifications were eventually delivered. The engineering team identified and fixed the bug that led to the issue, acknowledging the need for more robust testing, especially in exceptional scenarios like datacenter failures. To prevent future occurrences, PagerDuty plans to institute regular controlled failure tests called "Failure Friday" and aims to develop a system similar to Chaos Monkey to simulate random failures continuously, enhancing their systems' resilience.