Postgres High Availability with CDC
Blog post from PlanetScale
PostgreSQL's replication design for high availability (HA) with Change Data Capture (CDC) imposes constraints that can complicate operational coupling and flexibility, primarily due to its reliance on logical replication slots. In a typical HA setup, a primary PostgreSQL server is paired with semi-synchronous standbys, and a CDC client reads from a logical replication slot. This setup can lead to issues during failover, as logical slots must be synchronized across nodes, and a standby only becomes eligible to carry a slot after the CDC client has advanced the slot while the standby is receiving metadata. This results in potential delays or breakage of the CDC stream if the client is offline or lags, as new or restarted standbys remain ineligible for slot promotion. In contrast, MySQL's approach using GTIDs and binlogs provides more flexibility, as it allows for seamless failover without the tight coupling of logical slots. MySQL enables replicas to re-emit transactions into their own binlogs, maintaining GTID continuity, and allowing CDC connectors to resume from the last committed GTID, regardless of polling frequency. This fundamental difference highlights the brittleness of PostgreSQL's HA with logical consumers, where slot progress coordination across the cluster is crucial and dependent on subscriber behavior, potentially leading to switchover delays or CDC disruptions.