Failure Fridays: Four Years On
Blog post from PagerDuty
PagerDuty's "Failure Fridays" initiative, started in 2013, involves weekly controlled fault injections into their production environment to test and improve system resilience without affecting customers. Over the years, this practice has evolved, incorporating elements of Chaos Engineering and expanding from single service tests to infrastructure-wide simulations, including Availability Zone (AZ) and Region failures. This method has not only enabled PagerDuty to identify and rectify potential issues before they impact users but has also fostered internal trust and operational improvements. The initiative has led to significant process automation and documentation, including the development of tools like "Reboot Roulette" and "Chaos Cat" for fault injection. By June 2017, PagerDuty had conducted 121 sessions, injected 644 faults, and created over 200 tickets to address identified issues, illustrating how such stress tests can enhance both software delivery and team cohesion.