Post-incident review for 20th October 2025

Post Details

Company

Buildkite

Date Published

Oct. 25, 2025

Author

The Buildkite Team

Word Count

715

Company Posts That Month

4

Language

English

Hacker News Points

-

Source URL

buildkite.com/resources/blog/post-incident-review-for-20th-october-2025

Summary

On October 20, 2025, a major AWS outage in the us-east-1 region led to significant service disruptions for Buildkite customers, marked by high latency and increased error rates, primarily due to the inability to provision additional server capacity during peak usage hours. While initially customers experienced minimal impact, the situation worsened as traffic increased, resulting in varied impacts on job dispatch and notifications due to the sharded architecture. The worst-affected customers faced delays of over an hour, while REST and GraphQL APIs also suffered increased latency and error rates, affecting users depending on these for job execution. In response, Buildkite paused all deployments, utilized Statuspage for updates, and attempted to mitigate issues by redistributing loads across shards. Despite challenges in communication due to the outage, the on-call team managed to stabilize performance by the evening as AWS began allowing limited compute provisioning, with normal service levels resuming by late evening. The incident highlighted the need for enhanced resilience strategies and alternative communication avenues for handling such region-wide disruptions.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	1	1,423	250	85	+59%