Home / Companies / Honeycomb / Blog / Post Details
Content Deep Dive

Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment – Part 1

Blog post from Honeycomb

Post Details
Company
Date Published
Author
Lex Neva
Word Count
2,080
Language
English
Hacker News Points
-
Summary

Honeycomb conducted a bold chaos engineering experiment by intentionally disrupting one-third of its production infrastructure using AWS’s Fault Injection Service to test system resilience. This approach, which involved both non-production and production environments, aimed to uncover unexpected failure modes and improve system reliability without impacting customer service. Initial tests revealed issues like a Zookeeper lock acquisition bug, false telemetry alerts, and AWS PrivateLink traffic disruptions, which were addressed through bug fixes and mitigation strategies. By simulating failures in key components like the Shepherd service and observing a coincidental test of Kafka, Honeycomb gained confidence in its infrastructure's ability to handle an Availability Zone (AZ) failure. The company decided to test during peak traffic hours to ensure readiness and minimize risks associated with off-peak incident responses, demonstrating a commitment to learning from real-world scenarios to enhance reliability.