Company
Date Published
Author
Tim Nolet, Batuhan Icoz
Word count
479
Language
English
Hacker News points
None

Summary

We experienced a 5-hour outage between 15:44 PM UTC and 20:38 PM UTC due to a bug in our release and deployment software, which affected browser check results but not API checks. A recent change aimed to prevent accidental "canary flags" from being deployed without proper testing, but another code change was overlooked. The issue occurred when the deployment pipeline was updated without the necessary accompanying code change, triggering an anomaly that wasn't caught by our existing monitoring and alerting setup. We were alerted to the problem by a customer ticket on Intercom and resolved the issue with a hotfix deployed within 45 minutes. To prevent similar issues in the future, we've added more checks to the deployment process, implemented specific CloudWatch alarms for anomaly detection, and decided not to deploy after lunch hours for at least the foreseeable future.