Company
Date Published
Author
Ryan Duffield
Word count
433
Language
English
Hacker News points
None

Summary

PagerDuty's realtime engineering team experienced a brief outage on May 30, 2013, due to network latency issues at the Linode Fremont datacenter, which led to degraded alerting reliability. The outage triggered backup worker processes meant to handle notification queues, but these processes had poor error handling, causing delays in 7% of outgoing alerts, although all notifications were eventually delivered. The engineering team identified and fixed the bug that led to the issue, acknowledging the need for more robust testing, especially in exceptional scenarios like datacenter failures. To prevent future occurrences, PagerDuty plans to institute regular controlled failure tests called "Failure Friday" and aims to develop a system similar to Chaos Monkey to simulate random failures continuously, enhancing their systems' resilience.