How to Handle API Downtime at Scale
Blog post from Zapier
Internet downtime is inevitable, even for major platforms, and poses challenges for interconnected systems like Zapier, which integrates over 750 APIs. To mitigate the impact of API downtime on its services, Zapier developed a load-shedding strategy that identifies and stops polling APIs experiencing delays. This approach involves monitoring API response times, using percentile-based thresholds to predict downtime, and automating alert generation and polling cessation when necessary. Initially implemented through a simple script, the system evolved into Prometheus, a more sophisticated tool integrated into Zapier's platform. Prometheus automates downtime detection, generates alerts, and updates Zapier's public status page, providing transparency for users. This solution exemplifies the efficiency of using statistical methods to streamline engineering processes and manage API reliability effectively, highlighting the importance of having robust strategies to address the challenges of interconnected API systems.