Heap Engineering embarked on a rigorous process of intentionally breaking their data collection system to identify and address potential failure modes, inspired by practices like Jepsen and Chaos Monkey. This involved simulating availability zone (AZ) outages, isolating server components, and using artificial traffic to test system resilience without risking production data. Initial tests revealed issues with Redis dependency, leading to timeouts and Collector overloads, prompting the implementation of circuit-breakers and refactoring of Redis commands to improve responsiveness. Throughout this iterative testing, they uncovered and resolved various issues, such as slow Redis commands and inadequate logging during Collector restarts. The outcome was a more robust and reliable data collection system, reducing lost events and PagerDuty alerts significantly. The process not only strengthened the system's ability to handle AZ outages but also improved overall system performance, demonstrating the value of an agile and exploratory approach to system hardening.