Incident Review: Working as Designed, But Still Failing
Blog post from Honeycomb
In a detailed incident review, a company faced significant challenges related to query performance and alerting due to complex interactions between hot and cold data storage and the unexpected burden on AWS Lambda capacity. Initially, inaccurate timestamps in a customer's telemetry data led to trigger queries unnecessarily accessing cold storage, tying performance to Lambda usage. An assumption that future-stamps in triggers caused the issue misled the investigation until a fresh perspective identified that repeated backfilling of a single Service Level Objective (SLO) was the true culprit. The incident highlighted the difficulty in managing complex systems where valid but unexpected use cases can lead to resource exhaustion without any technical bugs. The resolution involved correcting the SLO, implementing stricter controls on data handling, and enhancing communication and support for incident management. This experience underscored the importance of diverse perspectives in troubleshooting and the need for adaptable controls in system design to manage unforeseen usage patterns effectively.