Company
Date Published
Author
Tom Lianza and Joaquin Madruga
Word count
1107
Language
English
Hacker News points
None

Summary

Cloudflare experienced a significant outage in its Tenant Service API, which led to widespread disruptions in many of its APIs and the Cloudflare Dashboard. This incident was triggered by a bug in the dashboard that caused excessive calls to the Tenant Service API due to a React useEffect hook misconfiguration, resulting in repeated API calls during a single dashboard render. Compounding this was a concurrent service update that caused the Tenant Service to become overwhelmed and fail, adversely affecting API request authorization and leading to 5xx errors. To address the issue, Cloudflare increased resources for the Tenant Service, implemented a temporary rate-limiting rule, and reverted problematic changes. However, a well-intentioned patch further degraded service before it was rolled back. The incident underscored the importance of improving release processes and observability tools, and Cloudflare plans to enhance its systems to better manage such incidents in the future. Despite the outage, core network services remained unaffected, and efforts are underway to prevent similar occurrences by adopting measures such as Argo Rollouts for error detection and rollback, increasing resource allocation, and improving monitoring and visibility tools.