Company
Date Published
Author
John Laban
Word count
1023
Language
English
Hacker News points
None

Summary

PagerDuty experienced a 15-minute outage due to internet connectivity issues across AWS's US-East-1 region, highlighting the need for enhanced system reliability and communication protocols. Despite having a fallback system hosted in a separate datacenter, delayed alerts from monitoring systems and internal miscommunication on using emergency broadcast systems contributed to the extended downtime. PagerDuty has been working on re-engineering its systems for full fault tolerance and aims to implement a new architecture that eliminates single points of failure, involving a clustered multi-node datastore across independent data centers. Immediate improvements include better redundancy for email and API endpoints, consideration of moving critical systems off AWS US-East, and enhancing monitoring systems and communication procedures, with further details to be shared in subsequent updates.