On October 20, 2025, a major AWS outage in the us-east-1 region led to significant service disruptions for Buildkite customers, marked by high latency and increased error rates, primarily due to the inability to provision additional server capacity during peak usage hours. While initially customers experienced minimal impact, the situation worsened as traffic increased, resulting in varied impacts on job dispatch and notifications due to the sharded architecture. The worst-affected customers faced delays of over an hour, while REST and GraphQL APIs also suffered increased latency and error rates, affecting users depending on these for job execution. In response, Buildkite paused all deployments, utilized Statuspage for updates, and attempted to mitigate issues by redistributing loads across shards. Despite challenges in communication due to the outage, the on-call team managed to stabilize performance by the evening as AWS began allowing limited compute provisioning, with normal service levels resuming by late evening. The incident highlighted the need for enhanced resilience strategies and alternative communication avenues for handling such region-wide disruptions.