ChaosCat: Automating Fault Injection at PagerDuty
Blog post from PagerDuty
Chaos Engineering involves experimenting on distributed systems to ensure they can withstand unpredictable conditions, with companies like Netflix, Dropbox, and Twilio employing such techniques. At PagerDuty, Chaos Engineering has evolved from manual fault injection to the development of an automated fault injection system called ChaosCat, which is inspired by Chaos Monkey but is more adaptable to various service types. Initially, failures were manually injected to allow precise control and understanding, but as the infrastructure grew, automation was introduced, enabling individual teams to conduct their own fault injections. ChaosCat operates as a Scala-based Slack bot that conducts randomized chaos attacks during business hours, only when the system is fully operational, ensuring teams are ready and able to address issues as they arise. This approach has highlighted the importance of addressing gaps in run books and on-call rotations, leading to more automation and prioritization of technical debt, thereby enhancing confidence in service reliability. Although ChaosCat is currently not open-sourced due to its deep integration with PagerDuty's internal systems, the company encourages feedback and questions, hoping more organizations will adopt Chaos Engineering to test and improve the resilience of their infrastructures.