Company
Date Published
Author
Pete Hamilton
Word count
2444
Language
English
Hacker News points
None

Summary

On October 20, 2025, a major AWS outage in the us-east-1 region significantly impacted several key services of a platform hosted in Google Cloud but reliant on AWS for third-party dependencies. The disruption affected on-call notifications, SAML authentication, and the Scribe AI incident note taker due to their reliance on AWS-hosted services. Despite the platform's design to tolerate integration failures and high load, unexpected dependencies and high traffic caused complications. The company responded by attempting to reroute services, scale Kubernetes deployments, and modify their notification system, although they faced additional challenges with their deployment pipeline due to Docker Hub dependencies. In response, the company has removed certain dependencies and optimized their infrastructure to prevent similar disruptions in the future while actively working on enhancing their systems’ resilience against such outages. The incident underscored the intricate risks associated with third-party providers and the importance of robust contingency planning.