Reliability lessons from the 2025 Microsoft Azure Front Door outage
Blog post from Gremlin
On October 29, 2025, a significant outage in Microsoft Azure Front Door impacted global services like Microsoft 365, Outlook, and Xbox Live, affecting companies such as Costco and Starbucks. The issue stemmed from a misconfiguration in Azure's data plane and content delivery network, taking seven hours for full recovery despite a rapid initial response. This incident underscores the importance of redundancy and failover systems, as well as the need for rigorous testing of dependencies using tools like Gremlin, which can simulate outages to verify system responsiveness. The outage highlights that customers hold businesses accountable for service disruptions, emphasizing the necessity for companies to ensure their systems are robust enough to handle such incidents. By mapping and testing dependencies, and understanding potential reliability risks, organizations can mitigate impacts and maintain service continuity, even when cloud providers face issues.