Root Cause Analysis for Render's Extended Service Disruption on 3/26/24
Blog post from Render
On March 26, 2024, Render experienced a significant service disruption beginning at 16:07 UTC due to a faulty code change that inadvertently restarted all workloads on their platform. While static sites and services without dependencies on PostgreSQL, Redis, or attached disks recovered quickly, services with attached disks faced extended recovery times due to system-level throttling and rate limits. Render's engineers swiftly identified and disabled the faulty code, and efforts to increase rate limits and prioritize paid services helped expedite recovery, with full functionality restored by 20:00 UTC. The incident highlighted inconsistencies in testing infrastructure and communication gaps, prompting Render to implement measures to enhance platform reliability, improve testing processes, and ensure more timely incident communications. Despite the disruption, Render remains committed to providing reliable cloud services and improving recovery protocols to prevent similar occurrences in the future.