Home / Companies / PagerDuty / Blog / Post Details
Content Deep Dive

Outage Post Mortem – April 14th, 2014

Blog post from PagerDuty

Post Details
Company
Date Published
Author
Tony Albanese
Word Count
352
Language
English
Hacker News Points
-
Summary

On April 14th, PagerDuty experienced a 30-minute outage affecting both mobile and web applications, resulting in delayed alerts and account management issues for customers. The incident was caused by an increased workload on their event processing system, which led to performance degradation and timeouts in an upstream system with a retry policy, ultimately causing significant system load and availability issues. Despite the delays, no events were lost, and all alerts were eventually sent. In response, PagerDuty's operations and engineering teams quickly alleviated the problem by removing duplicate queued events and adjusting the retry policy to prevent future occurrences. Long-term solutions include rebalancing timeout and retry policies and separating event processing from customer-facing applications to enhance reliability and performance. The company has apologized for the service disruption and is committed to preventing similar issues in the future.