Outage Post Mortem – June 3rd & 4th, 2014

Post Details

Company

PagerDuty

Date Published

June 12, 2014

Author

John Laban

Word Count

837

Language

English

Hacker News Points

-

Source URL

www.pagerduty.com/blog/uncategorized/outage-post-mortem-june-3rd-4th-2014

Summary

PagerDuty experienced two significant SEV-1 outages on June 3rd and 4th, impacting their Notification Pipeline due to issues with their Cassandra NoSQL datastore. The first outage on June 3rd resulted in delayed notifications and degraded performance, while the more severe outage on June 4th led to a substantial portion of events and notifications being delayed or undelivered. The outages were linked to the background repair process of Cassandra, which strained the system when combined with high workloads, causing instability. Efforts to stabilize the situation initially included stopping the repair process and reducing the load, but these measures were insufficient, leading to a drastic "factory reset" to regain control. PagerDuty acknowledged being underscaled and sharing the cluster among various services with different load patterns, which contributed to the problem. They are implementing changes to prevent future outages, including scaling up Cassandra nodes, setting up multiple clusters, and bringing in additional expertise, while also acknowledging that some planned improvements were delayed due to prioritization based on efficiency.