Company
Date Published
Author
Gremlin
Word count
3232
Language
English
Hacker News points
None

Summary

At Chaos Conf 2019, Robert Ross, CEO of FireHydrant, and Tammy Butow, Principal SRE at Gremlin, discussed the importance of incident reproduction and playbook validation using chaos engineering, referencing the 2017 AWS S3 outage as a case study. They emphasized the need for organizations to simulate incidents to prevent recurrence and highlighted how chaos engineering can be applied not only to software but also to processes and team onboarding. The speakers illustrated the significance of using chaos engineering tools, like Gremlin, to test system resilience by reproducing past outages and validating the reliability of playbooks. They concluded by underscoring the importance of having validated processes in place, akin to "team traditions," to ensure effective incident response and system reliability, encouraging continuous improvement through iterative testing and learning.