Home / Companies / Cloudflare / Blog / Post Details
Content Deep Dive

Code Orange: Fail Small — our resilience plan following recent incidents

Blog post from Cloudflare

Post Details
Company
Date Published
Author
Dane Knecht
Word Count
1,981
Language
English
Hacker News Points
-
Summary

Cloudflare experienced two significant network outages in November and December 2025, affecting a large portion of their services and prompting the company to launch a comprehensive initiative called "Code Orange: Fail Small" to enhance network resilience and prevent future incidents. The first outage was caused by an automatic update to the Bot Management classifier, while the second resulted from a security tool configuration change meant to address a React vulnerability. Both incidents highlighted flaws in how configuration changes were deployed compared to software updates, leading Cloudflare to adopt a more controlled rollout process for configurations, similar to their Health Mediated Deployment (HMD) system for software. The plan involves organized workstreams to require controlled rollouts for configuration changes, review and improve failure modes, and revise internal emergency response procedures to mitigate risks and ensure quick access to necessary tools during incidents. The company aims to make iterative improvements across its network infrastructure to enhance reliability and has committed to completing significant updates by the end of Q1, while maintaining ongoing efforts to address circular dependencies and update security protocols.