Outage Post-Mortem – March 25, 2014
Blog post from PagerDuty
On March 25th, PagerDuty experienced a significant service degradation lasting three hours, affecting customers by delaying 11% of notifications and failing to accept 2.5% of event attempts. The issue stemmed from an overload in their Cassandra-based notifications pipeline, exacerbated by both steady-state and bursty workloads from scheduled jobs. Although the system's retry logic was designed to handle transient failures, it inadvertently prolonged the overload period. To address these issues, PagerDuty plans to temporally distribute and flatten scheduled job loads, isolate systems onto separate Cassandra clusters to prevent cross-system interference, and adjust failure detection and retry policies to better handle overloads. They are committed to enhancing reliability and will incorporate overload scenarios into their failure testing regime.