Postmortem for Aurora Postgres Migration November 23 2022
Blog post from RevenueCat
On November 23, a migration from AWS Aurora Postgres 10.x to 14.x led to significant performance degradation, severely affecting backend systems due to inefficient query planning caused by unexecuted ANALYZE on the largest tables. The migration, planned due to the impending end of support for Aurora Postgres 10.x, utilized a new approach focusing on database replication to maintain data consistency. Despite extensive preparation, the transition resulted in a temporary failure of backend systems and impacted user experience, especially for new purchases which faced entitlement unlocking issues. The complexity of the system caused cascading failures, and identifying the root cause was challenging, taking several hours to resolve. Measures to address the issue included query plan adjustments and manual execution of ANALYZE on key tables, alongside temporary suspension of incoming Apple webhook requests to improve recovery. Post-migration assessments highlighted the importance of timing, communication, and thorough testing, particularly for write operations, to prevent similar incidents in the future.