No record left behind: How WarpStream can withstand cloud provider regional outages
Blog post from WarpStream
WarpStream has introduced Multi Region Clusters to enhance the resiliency of its systems, ensuring minimal downtime even if an entire cloud provider region fails. Unlike standard clusters backed by a single control plane region with a 99.99% availability guarantee, Multi Region Clusters utilize multiple control plane regions and a replicated data plane, allowing them to survive the disappearance of a whole region without losing any ingested data or experiencing more than a few seconds of downtime. This architecture employs a quorum of three object storage buckets for writes, ensuring data durability, and uses leader election to manage control plane traffic, reducing latency and write conflicts. The system's design accommodates both hard and soft failures, allowing for seamless leadership transitions and maintaining functionality with no data loss or manual intervention, achieving a Recovery Point Objective (RPO) of zero. Multi Region Clusters are offered with a 99.999% uptime SLA, providing robust protection against regional outages and making WarpStream suitable for mission-critical workloads.