Home / Companies / PagerDuty / Blog / Post Details
Content Deep Dive

Outage Post-Mortem – March 25, 2014

Blog post from PagerDuty

Post Details
Company
Date Published
Author
Paul Rechsteiner
Word Count
590
Language
English
Hacker News Points
-
Summary

On March 25th, PagerDuty experienced a significant service degradation lasting three hours, affecting customers by delaying 11% of notifications and failing to accept 2.5% of event attempts. The issue stemmed from an overload in their Cassandra-based notifications pipeline, exacerbated by both steady-state and bursty workloads from scheduled jobs. Although the system's retry logic was designed to handle transient failures, it inadvertently prolonged the overload period. To address these issues, PagerDuty plans to temporally distribute and flatten scheduled job loads, isolate systems onto separate Cassandra clusters to prevent cross-system interference, and adjust failure detection and retry policies to better handle overloads. They are committed to enhancing reliability and will incorporate overload scenarios into their failure testing regime.