Home / Companies / Gremlin / Blog / Post Details
Content Deep Dive

Robert Ross & Tammy Butow: Incident Repro & Playbook Validation with Chaos Engineering - Chaos Conf

Blog post from Gremlin

Post Details
Company
Date Published
Author
Gremlin
Word Count
3,232
Language
English
Hacker News Points
-
Summary

At Chaos Conf 2019, Robert Ross, CEO of FireHydrant, and Tammy Butow, Principal SRE at Gremlin, discussed the importance of incident reproduction and playbook validation using chaos engineering, referencing the 2017 AWS S3 outage as a case study. They emphasized the need for organizations to simulate incidents to prevent recurrence and highlighted how chaos engineering can be applied not only to software but also to processes and team onboarding. The speakers illustrated the significance of using chaos engineering tools, like Gremlin, to test system resilience by reproducing past outages and validating the reliability of playbooks. They concluded by underscoring the importance of having validated processes in place, akin to "team traditions," to ensure effective incident response and system reliability, encouraging continuous improvement through iterative testing and learning.