Outage Post Mortem – April 14th, 2014

Post Details

Company

PagerDuty

Date Published

April 28, 2014

Author

Tony Albanese

Word Count

352

Company Posts That Month

11

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.pagerduty.com/blog/uncategorized/outage-post-mortem-april-14th-2014

Summary

On April 14th, PagerDuty experienced a 30-minute outage affecting both mobile and web applications, resulting in delayed alerts and account management issues for customers. The incident was caused by an increased workload on their event processing system, which led to performance degradation and timeouts in an upstream system with a retry policy, ultimately causing significant system load and availability issues. Despite the delays, no events were lost, and all alerts were eventually sent. In response, PagerDuty's operations and engineering teams quickly alleviated the problem by removing duplicate queued events and adjusting the retry policy to prevent future occurrences. Long-term solutions include rebalancing timeout and retry policies and separating event processing from customer-facing applications to enhance reliability and performance. The company has apologized for the service disruption and is committed to preventing similar issues in the future.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.