Home / Companies / Honeycomb / Blog / Post Details
Content Deep Dive

Incident Report: Exercises, Cleanups, and Evacuations

Blog post from Honeycomb

Post Details
Company
Date Published
Author
Fred Hebert
Word Count
4,699
Language
English
Hacker News Points
-
Summary

Honeycomb's disaster recovery exercise on December 5th, which involved intentionally destroying Kafka brokers as part of the process, led to significant disruptions in their production environment. Though previous pre-production tests had run smoothly, the production incident caused multiple Kafka topic partitions to go leaderless, resulting in damaged data and operational challenges. The company's response involved identifying impacted teams, salvaging partitions, and managing disk space concerns. Over the following weeks, Honeycomb undertook a complex recovery process, including the temporary shutdown of tiered storage to free disk space, manual repairs of Kafka topics, and ultimately planning a complete migration to a new Kafka cluster. The incident revealed issues with documentation, communication, and sociotechnical alignment within the organization, prompting reflections on improving incident management and knowledge sharing. This extensive effort highlighted the importance of strategic planning and adaptability in response to unforeseen technological challenges.