10 Years of Failure Friday at PagerDuty: Fostering Resilience, Learning and Reliability
Blog post from PagerDuty
Failure Friday is a decade-long practice at PagerDuty, inspired by chaos engineering, which involves deliberately introducing system failures to enhance service resilience and foster a proactive engineering culture. Spearheaded by Senior Engineering Manager Stevenson Jean-Pierre (SJP), the practice has evolved from automated failure testing to a more intentional approach that focuses on specific failure scenarios to gain actionable insights into system vulnerabilities. The initiative, which has expanded beyond Fridays, is designed to improve engineers' understanding of complex digital infrastructures and uncover hidden dependencies, ultimately leading to more robust software and improved system reliability. DevOps Advocate Mandi Walls highlights Failure Friday’s role in enhancing customer experience by fostering innovation, graceful error handling, and clear communication. The practice encourages collaboration among engineers, product managers, and business owners while promoting a blame-free environment for open discussions and continuous learning. Despite challenges like stability concerns and the need for cultural shifts, Failure Friday is seen as a valuable approach for organizations aiming to improve system resilience and customer satisfaction. Both SJP and Mandi suggest starting small and building a culture of psychological safety to successfully integrate Failure Friday into an organization's DevOps processes.