Company
Date Published
Author
David Hayes
Word count
415
Language
English
Hacker News points
None

Summary

At PagerDuty, the weekend saw significant disruptions due to a derecho storm affecting 7% of AWS and the addition of a leap second to UTC, both of which caused widespread server issues, though the AWS outage appeared more severe at first glance. While initial graphs showed a 20x increase in traffic related to the leap second, further investigation revealed that the AWS spike hit faster and was 30 times higher than the average traffic during the peak, compared to the leap second's 18 times increase. The AWS outage averaged 7 times higher over two hours, while the leap second spike averaged 9 times higher, highlighting the complexity of interpreting internet incidents and system load. Despite these challenges, PagerDuty's approach to monitoring at the account level through de-duplication and escalation helped manage the situation, with a focus on understanding different alert types providing further insight into the response dynamics.