Incident Report: Exercises, Cleanups, and Evacuations
Blog post from Honeycomb
Honeycomb's disaster recovery exercise on December 5th, which involved intentionally destroying Kafka brokers as part of the process, led to significant disruptions in their production environment. Though previous pre-production tests had run smoothly, the production incident caused multiple Kafka topic partitions to go leaderless, resulting in damaged data and operational challenges. The company's response involved identifying impacted teams, salvaging partitions, and managing disk space concerns. Over the following weeks, Honeycomb undertook a complex recovery process, including the temporary shutdown of tiered storage to free disk space, manual repairs of Kafka topics, and ultimately planning a complete migration to a new Kafka cluster. The incident revealed issues with documentation, communication, and sociotechnical alignment within the organization, prompting reflections on improving incident management and knowledge sharing. This extensive effort highlighted the importance of strategic planning and adaptability in response to unforeseen technological challenges.