Outage Post-Mortem – March 25, 2014

Post Details

Company

PagerDuty

Date Published

April 11, 2014

Author

Paul Rechsteiner

Word Count

590

Language

English

Hacker News Points

-

Source URL

www.pagerduty.com/blog/uncategorized/outage-post-mortem-march-25-2014

Summary

On March 25th, PagerDuty experienced a significant service degradation lasting three hours, affecting customers by delaying 11% of notifications and failing to accept 2.5% of event attempts. The issue stemmed from an overload in their Cassandra-based notifications pipeline, exacerbated by both steady-state and bursty workloads from scheduled jobs. Although the system's retry logic was designed to handle transient failures, it inadvertently prolonged the overload period. To address these issues, PagerDuty plans to temporally distribute and flatten scheduled job loads, isolate systems onto separate Cassandra clusters to prevent cross-system interference, and adjust failure detection and retry policies to better handle overloads. They are committed to enhancing reliability and will incorporate overload scenarios into their failure testing regime.