Incident Resolution: Do You Remember, the Twenty Fires of September?

Post Details

Company

Honeycomb

Date Published

Nov. 10, 2021

Author

Fred Hebert

Word Count

1,578

Company Posts That Month

5

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.honeycomb.io/blog/incident-resolution-september-retrospective

Summary

Throughout September and early October, Honeycomb faced significant operational challenges, experiencing over 20 internal issues and five public incidents due to an accelerated growth in data ingestion by 40% over a few weeks. This surge led to various system strain points, including Kafka cluster imbalances, Linux file system bugs, and unexpected deletions in Terraform, compounded by the migration of services to EKS containers and under-provisioned dogfood environments. Beagle, an SLO data processor, faced persistent delays due to network throttling and partition imbalances, which necessitated aggressive scaling measures and adjustments in Kafka configurations. Despite efforts to mitigate these issues through scaling and manual intervention, Honeycomb encountered complexities in balancing component demands, with unique scaling patterns required for different services. The experience underscored the importance of understanding system limits and maintaining a sustainable operational pace to adapt effectively to production challenges, while emphasizing the learning opportunities presented by each incident. Honeycomb invites community feedback on scaling experiences and operational pacing practices through platforms like Twitter and their community Slack group, Pollinators.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.