Incident Review: Working as Designed, But Still Failing

Post Details

Company

Honeycomb

Date Published

Sept. 9, 2022

Author

Fred Hebert

Word Count

1,326

Company Posts That Month

11

Language

English

Hacker News Points

-

Source URL

www.honeycomb.io/blog/incident-review-designed-failing

Summary

In a detailed incident review, a company faced significant challenges related to query performance and alerting due to complex interactions between hot and cold data storage and the unexpected burden on AWS Lambda capacity. Initially, inaccurate timestamps in a customer's telemetry data led to trigger queries unnecessarily accessing cold storage, tying performance to Lambda usage. An assumption that future-stamps in triggers caused the issue misled the investigation until a fresh perspective identified that repeated backfilling of a single Service Level Objective (SLO) was the true culprit. The incident highlighted the difficulty in managing complex systems where valid but unexpected use cases can lead to resource exhaustion without any technical bugs. The resolution involved correcting the SLO, implementing stricter controls on data handling, and enhancing communication and support for incident management. This experience underscored the importance of diverse perspectives in troubleshooting and the need for adaptable controls in system design to manage unforeseen usage patterns effectively.

Trends Found in this Post

No tracked trend matches for this post yet.