How Gremlin runs a GameDay
Blog post from Gremlin
Gremlin's GameDay is a structured practice designed to enhance system reliability by deliberately creating and analyzing failures in a controlled environment. Conducted at least once a month, these events involve the entire company to foster diversity in perspectives and insights. The process is divided into three phases: preparing through a PreGame, executing through a GameDay, and learning through findings. During the PreGame, a detailed template is filled out, and scenarios are planned, while the execution phase involves virtual collaboration where scenarios are run, and system behaviors are monitored. Post-scenario discussions and documentation in Jira help in resolving issues and refining processes. This methodical approach allows Gremlin to continuously test assumptions, validate past fixes, and ensure their systems are resilient, ultimately aiming for a more reliable internet.