Outage Post Mortem – March 15

Company

PagerDuty

Date Published

March 16, 2012

Author

John Laban

Word count

1023

Language

English

Hacker News points

None

URL

www.pagerduty.com/blog/uncategorized/outage-post-mortem-march-15

Summary

PagerDuty experienced a 15-minute outage due to internet connectivity issues across AWS's US-East-1 region, highlighting the need for enhanced system reliability and communication protocols. Despite having a fallback system hosted in a separate datacenter, delayed alerts from monitoring systems and internal miscommunication on using emergency broadcast systems contributed to the extended downtime. PagerDuty has been working on re-engineering its systems for full fault tolerance and aims to implement a new architecture that eliminates single points of failure, involving a clustered multi-node datastore across independent data centers. Immediate improvements include better redundancy for email and API endpoints, consideration of moving critical systems off AWS US-East, and enhancing monitoring systems and communication procedures, with further details to be shared in subsequent updates.