Home / Companies / Honeycomb / Blog / Post Details
Content Deep Dive

Incident Report: Running Dry on Memory Without Noticing

Blog post from Honeycomb

Post Details
Company
Date Published
Author
Liz Fong-Jones
Word Count
1,080
Language
English
Hacker News Points
-
Summary

On November 6, 2019, a slow memory leak in a company's ingest backend systems led to the intermittent rejection of 1-3% of customer telemetry data over several short periods, causing widespread backend crashes and request failures. Despite quickly detecting the issue through Service Level Objective (SLO) measurements, confirmation bias and unclear incident handling delayed resolution, as engineers mistakenly attributed the problem to AWS's Application Load Balancers. The company later identified the true cause as a memory leak and process restarts, promptly reverting a problematic commit to stabilize the service. The incident highlighted the need for improvements in incident declaration, assumption questioning, and communication protocols, ultimately emphasizing the importance of transparency and iterative process refinement to prevent future occurrences.