Home / Companies / Honeycomb / Blog / Post Details
Content Deep Dive

Incident Report: Exercises, Cleanups, and Evacuations

Blog post from Honeycomb

Post Details
Company
Date Published
Author
Fred Hebert
Word Count
4,699
Company Posts That Month
6
Language
English
Hacker News Points
-
Summary

Honeycomb's disaster recovery exercise on December 5th, which involved intentionally destroying Kafka brokers as part of the process, led to significant disruptions in their production environment. Though previous pre-production tests had run smoothly, the production incident caused multiple Kafka topic partitions to go leaderless, resulting in damaged data and operational challenges. The company's response involved identifying impacted teams, salvaging partitions, and managing disk space concerns. Over the following weeks, Honeycomb undertook a complex recovery process, including the temporary shutdown of tiered storage to free disk space, manual repairs of Kafka topics, and ultimately planning a complete migration to a new Kafka cluster. The incident revealed issues with documentation, communication, and sociotechnical alignment within the organization, prompting reflections on improving incident management and knowledge sharing. This extensive effort highlighted the importance of strategic planning and adaptability in response to unforeseen technological challenges.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Observability 1 2,816 550 145 +34%
Platform Engineering 1 368 138 58 +24%
Real-time 1 5,046 1,089 214 +11%