Outage Post Mortem – Jan 23, 2014
Blog post from PagerDuty
PagerDuty experienced an outage on January 23, 2014, due to a slow database query linked to new mobile app functionality, which caused high load on a database server, resulting in delayed notifications and some event loss for users over an 18-minute period. The team promptly addressed the issue by terminating the problematic queries and rolling back the mobile app version, while also removing backend functionality to mitigate the problem for iOS users due to slower app publishing processes. In response, PagerDuty has refactored the code to improve query performance significantly and implemented a slow query killer to proactively manage similar issues. Moving forward, they plan to enhance database query audits, perform more rigorous performance testing, optimize server configurations, and increase monitoring to prevent future incidents, all while committing to transparency and reliability for their customers.