Home / Companies / Honeycomb / Blog / Post Details
Content Deep Dive

Incident Review: Working as Designed, But Still Failing

Blog post from Honeycomb

Post Details
Company
Date Published
Author
Fred Hebert
Word Count
1,326
Language
English
Hacker News Points
-
Summary

In a detailed incident review, a company faced significant challenges related to query performance and alerting due to complex interactions between hot and cold data storage and the unexpected burden on AWS Lambda capacity. Initially, inaccurate timestamps in a customer's telemetry data led to trigger queries unnecessarily accessing cold storage, tying performance to Lambda usage. An assumption that future-stamps in triggers caused the issue misled the investigation until a fresh perspective identified that repeated backfilling of a single Service Level Objective (SLO) was the true culprit. The incident highlighted the difficulty in managing complex systems where valid but unexpected use cases can lead to resource exhaustion without any technical bugs. The resolution involved correcting the SLO, implementing stricter controls on data handling, and enhancing communication and support for incident management. This experience underscored the importance of diverse perspectives in troubleshooting and the need for adaptable controls in system design to manage unforeseen usage patterns effectively.