Incident Report: Exercises, Cleanups, and Evacuations

Post Details

Company

Honeycomb

Date Published

Feb. 25, 2026

Author

Fred Hebert

Word Count

4,699

Company Posts That Month

6

Language

English

Hacker News Points

-

Source URL

www.honeycomb.io/blog/incident-report-exercises-cleanups-and-evacuations

Summary

Honeycomb's disaster recovery exercise on December 5th, which involved intentionally destroying Kafka brokers as part of the process, led to significant disruptions in their production environment. Though previous pre-production tests had run smoothly, the production incident caused multiple Kafka topic partitions to go leaderless, resulting in damaged data and operational challenges. The company's response involved identifying impacted teams, salvaging partitions, and managing disk space concerns. Over the following weeks, Honeycomb undertook a complex recovery process, including the temporary shutdown of tiered storage to free disk space, manual repairs of Kafka topics, and ultimately planning a complete migration to a new Kafka cluster. The incident revealed issues with documentation, communication, and sociotechnical alignment within the organization, prompting reflections on improving incident management and knowledge sharing. This extensive effort highlighted the importance of strategic planning and adaptability in response to unforeseen technological challenges.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Observability	1	2,816	550	145	+34%
Platform Engineering	1	368	138	58	+24%
Real-time	1	5,046	1,089	214	+11%