The principles of extreme fault tolerance
Blog post from PlanetScale
PlanetScale emphasizes extreme fault tolerance through a combination of principles, architecture, and processes that ensure reliability and minimize disruptions during failures. Key principles include isolation, redundancy, and static stability, which guide the design of its systems to prevent cascading failures and maintain operations with the last known good state. The architecture consists of a control plane and a data plane, with redundancy and minimal dependencies to enhance reliability. Processes like automatic failovers, query buffering, and synchronous replication allow seamless transitions during failures, while progressive delivery minimizes the impact of changes. The system is designed to tolerate various failure modes, including non-query-path failures, cloud provider failures, and zonal or regional disruptions, ensuring customer queries remain unaffected. PlanetScale's use of MySQL semi-sync replication and Postgres synchronous commits, along with feature flags for gradual rollouts, further bolsters its capacity to handle issues efficiently and maintain service continuity.