Incident Report: Running Dry on Memory Without Noticing

Post Details

Company

Honeycomb

Date Published

Nov. 21, 2019

Author

Liz Fong-Jones

Word Count

1,080

Company Posts That Month

4

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.honeycomb.io/blog/incident-report-running-dry-on-memory-without-noticing

Summary

On November 6, 2019, a slow memory leak in a company's ingest backend systems led to the intermittent rejection of 1-3% of customer telemetry data over several short periods, causing widespread backend crashes and request failures. Despite quickly detecting the issue through Service Level Objective (SLO) measurements, confirmation bias and unclear incident handling delayed resolution, as engineers mistakenly attributed the problem to AWS's Application Load Balancers. The company later identified the true cause as a memory leak and process restarts, promptly reverting a problematic commit to stabilize the service. The incident highlighted the need for improvements in incident declaration, assumption questioning, and communication protocols, ultimately emphasizing the importance of transparency and iterative process refinement to prevent future occurrences.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.