Code Orange: Fail Small — our resilience plan following recent incidents
Blog post from Cloudflare
Cloudflare experienced two significant network outages in November and December 2025, affecting a large portion of their services and prompting the company to launch a comprehensive initiative called "Code Orange: Fail Small" to enhance network resilience and prevent future incidents. The first outage was caused by an automatic update to the Bot Management classifier, while the second resulted from a security tool configuration change meant to address a React vulnerability. Both incidents highlighted flaws in how configuration changes were deployed compared to software updates, leading Cloudflare to adopt a more controlled rollout process for configurations, similar to their Health Mediated Deployment (HMD) system for software. The plan involves organized workstreams to require controlled rollouts for configuration changes, review and improve failure modes, and revise internal emergency response procedures to mitigate risks and ensure quick access to necessary tools during incidents. The company aims to make iterative improvements across its network infrastructure to enhance reliability and has committed to completing significant updates by the end of Q1, while maintaining ongoing efforts to address circular dependencies and update security protocols.