Outage Post-Mortem – June 14

Post Details

Company

PagerDuty

Date Published

June 19, 2012

Author

Alex Solomon

Word Count

1,196

Company Posts That Month

4

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.pagerduty.com/blog/uncategorized/outage-post-mortem-june-14

Summary

On June 14, PagerDuty experienced a significant outage beginning at 8:44 pm Pacific time, resulting in 30 minutes of downtime followed by a period of high load. The application, primarily hosted on AWS in the US-East region, suffered from an AWS console failure, prompting an emergency switch to a backup provider. This emergency flip was completed by 9:14 pm, restoring operations but under high load, which was resolved by 10:03 pm. Despite the successful flip, the team identified several areas for improvement, including faster monitoring notification, better group call organization, and a more efficient flip process. To prevent future occurrences, PagerDuty plans to migrate its data center to AWS US-West, implement a three-provider setup for greater fault tolerance, and enhance its internal monitoring and communication tools. Additionally, they aim to streamline and automate the emergency flip process and conduct load testing to improve system performance during high-load scenarios.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.