Company
Date Published
Author
Tom Hacohen
Word count
1424
Language
English
Hacker News points
None

Summary

The outage occurred on June 6th, 2023, due to a change made to the database authentication token fetching code, which caused a full table scan instead of using indices in the US region, leading to elevated errors and CPU usage. The issue was only present in the US region, not affecting other regions. An investigation revealed that the problem required a large number of parallel API calls with different authentication tokens to trigger, while continuous canary tests only emulated a small number of customers. The team is still investigating why the indices weren't used and has implemented measures to prevent similar issues, including a company-wide feature freeze, improved deployment speed, and automatic multi-region fallbacks. The incident highlights the importance of thorough testing and monitoring in ensuring the reliability and performance of API services.