Company
Date Published
Author
Dan Bennett
Word count
830
Language
English
Hacker News points
None

Summary

The Sentry team experienced a large-scale incident on May 6th, 2022, resulting in approximately 90% of incoming events being lost for 6+ hours, with significant impacts on customer data and build processes. The root cause was an issue within the Google Cloud Platform's primary compute region, affecting persistent volumes attached to Sentry's infrastructure. The SRE team worked tirelessly to identify and mitigate the issue, eventually resolving it at 12:17 PM PDT. However, the incident highlighted weaknesses in Sentry's resilience against single-zone failures and distributed ingestion infrastructure, leading to prioritized work on improving these areas to reduce future incidents and improve recovery times.