Incident Report: Investigating an Incident That’s Already Resolved
Blog post from Honeycomb
On April 16, an unnoticed incident occurred in which approximately 10% of event traffic to API hosts was rejected due to four new API servers responding with 401 and 500 errors for about 1.5 hours, caused by a database migration error. This issue was discovered a week later when a customer reported missing data. The investigation revealed that the error stemmed from an automated deployment that proceeded despite an incomplete database migration, leading new hosts to request a non-existent database column, resulting in 500 errors and subsequent 401 errors due to null API keys. The incident underscored the importance of thorough observability and data retention, which allowed the team to diagnose the problem retrospectively and implement changes to prevent similar occurrences. The incident also highlighted the need for more comprehensive SLO monitoring, including 401 errors, to better capture potential disruptions in user experience.