Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment

Post Details

Company

Honeycomb

Date Published

July 16, 2024

Author

Lex Neva

Word Count

2,080

Company Posts That Month

8

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.honeycomb.io/blog/destroy-on-friday-chaos-engineering-pt1

Summary

Honeycomb conducted a bold chaos engineering experiment by intentionally disrupting one-third of its production infrastructure using AWS’s Fault Injection Service to test system resilience. This approach, which involved both non-production and production environments, aimed to uncover unexpected failure modes and improve system reliability without impacting customer service. Initial tests revealed issues like a Zookeeper lock acquisition bug, false telemetry alerts, and AWS PrivateLink traffic disruptions, which were addressed through bug fixes and mitigation strategies. By simulating failures in key components like the Shepherd service and observing a coincidental test of Kafka, Honeycomb gained confidence in its infrastructure's ability to handle an Availability Zone (AZ) failure. The company decided to test during peak traffic hours to ensure readiness and minimize risks associated with off-peak incident responses, demonstrating a commitment to learning from real-world scenarios to enhance reliability.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.

Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment – Part 1