How to test application resiliency by simulating the Cloudflare December 2025 outage
Blog post from Gremlin
In December 2025, Cloudflare experienced a 25-minute outage that resulted in HTTP 500 errors affecting 28% of its traffic, highlighting the need for application resiliency tests. Engineering teams often face challenges in proving their systems' resilience without conducting reliability tests. Gremlin's Failure Flags offers a solution by allowing controlled simulations of such failures at the application layer, particularly targeting HTTP 500 error codes. This tool enables users to inject faults at specific layers of the application stack, offering insights into how applications respond to outages. By using Failure Flags, teams can recreate errors, such as those experienced during the Cloudflare outage, to assess and enhance their systems' resilience. The process involves deploying the Failure Flags agent and SDK, creating experiments, and monitoring application metrics to determine the system's ability to handle high-impact outages effectively. This proactive approach ensures that applications are prepared for real-world disruptions, providing a data-driven method to improve reliability and reduce risks.