Destroy on Friday: The Big Day 🧨 A Chaos Engineering Experiment – Part 2
Blog post from Honeycomb
In the second part of the blog post detailing a chaos engineering experiment, the team tested their infrastructure's resilience by simulating the failure of an entire availability zone (AZ) using AWS’s Fault Injection Service (FIS). By abruptly terminating EC2 instances, blocking network communication, and preventing new instance creation, they aimed to understand the impact on their systems. The experiment revealed a specific issue with AWS's PrivateLink, which routed some requests to the affected AZ, resulting in HTTP 504 errors. The team preemptively mitigated this by instructing the Application Load Balancer (ALB) to avoid using the compromised AZ. Despite initial concerns from telemetry data, the actual impact was minimal, with minor HTTP 500 errors affecting about one percent of requests for 31 seconds, and their infrastructure continued to handle the load effectively. Kubernetes rapidly launched new instances to replace those terminated, and within hours, their infrastructure returned to normal. The test confirmed the system's resilience and provided insights for future improvements, encouraging similar experiments to ensure service reliability.