How to Manage a Tier Zero Service
Blog post from PagerDuty
PagerDuty's SVP of Product Development, Tim Armandpour, emphasizes the importance of adopting robust incident response processes to ensure system reliability in a world that demands constant availability. He introduces the concept of "Failure Fridays," where their engineering team deliberately injects failures into their live production environment to improve system resilience and practice effective incident response. This approach focuses on understanding failure scenarios, fostering collaboration across organizational parts, and preparing teams to handle real-life incidents without panic. Key learnings include testing various failure scenarios to expose vulnerabilities, maintaining a blameless post-mortem process to derive actionable improvements, and treating each identified vulnerability as a chance to enhance infrastructure resilience. The practice underscores that reliability is crucial to their customer promise, and preparing for failures is a vital part of their operational strategy.