Who watches the watchmen?
Blog post from PagerDuty
PagerDuty employs a comprehensive and multi-faceted approach to monitoring its systems, ensuring reliability for the millions of alerts it delivers monthly. Arup Chakrabarti, PagerDuty's engineering manager, detailed the company's monitoring strategies at DevOps Days Chicago, highlighting the use of various tools like New Relic for application performance management, StatsD and DataDog for customizable metrics monitoring, and SumoLogic for log analysis. The company also utilizes external monitoring services such as Wormly and Monitis, alongside its own platform, to consolidate alerts. Emphasizing a focus on cluster-level metrics over single-host monitoring, PagerDuty avoids the brittleness of monitoring individual servers by funneling alerts through a robust monitoring system. They also perform dependency monitoring on third-party SaaS systems with a combination of manual and automated checks, exemplified by their SMS delivery testing framework. Chakrabarti stresses the importance of alerting on meaningful metrics that impact customers, with internal exercises like "Failure Friday" validating the effectiveness of alerts by intentionally disrupting services to test monitoring systems.