Preparing for the worst: Our core database failover test
Blog post from Vercel
In an effort to enhance operational resilience, an engineering team successfully conducted a full production failover of their core control-plane database from Azure West US to East US 2, with zero customer impact. This high-stakes exercise tested the entire control-plane traffic, including API requests and deployment operations, while ensuring production CDN traffic remained unaffected. The operation was motivated by previous datacenter outages and aimed to verify that their architecture could maintain uptime and serve production traffic seamlessly. Through rigorous preparation, including addressing issues with proprietary Cosmos DB clients and testing codepaths across multiple staging failovers, the team was well-prepared for the live exercise. The failover was executed efficiently, with minimal operational impact, validating system health through targeted and catch-all alerts. The team remains committed to refining their processes and infrastructure, acknowledging the crucial role of their partnership with Azure in achieving resilience and reliability.