Outage Post Mortem – June 3rd & 4th, 2014
Blog post from PagerDuty
PagerDuty experienced two significant SEV-1 outages on June 3rd and 4th, impacting their Notification Pipeline due to issues with their Cassandra NoSQL datastore. The first outage on June 3rd resulted in delayed notifications and degraded performance, while the more severe outage on June 4th led to a substantial portion of events and notifications being delayed or undelivered. The outages were linked to the background repair process of Cassandra, which strained the system when combined with high workloads, causing instability. Efforts to stabilize the situation initially included stopping the repair process and reducing the load, but these measures were insufficient, leading to a drastic "factory reset" to regain control. PagerDuty acknowledged being underscaled and sharing the cluster among various services with different load patterns, which contributed to the problem. They are implementing changes to prevent future outages, including scaling up Cassandra nodes, setting up multiple clusters, and bringing in additional expertise, while also acknowledging that some planned improvements were delayed due to prioritization based on efficiency.