Postmortem: Clerk System Outage (March 10, 2026)
Blog post from Clerk
Clerk recently experienced a significant service outage due to a failed live migration of their Google Cloud SQL virtual machine, leading to increased disk latency and lock contention. Despite efforts to mitigate the issue by moving reads to database replicas, the incident resulted in a complete service outage, with many session token requests returning 429 responses instead of the expected 500s. The root cause was identified as a failure within their database provider's live migration process, which Clerk had previously believed to be reliable after extensive collaboration with Google to address similar past issues. Clerk has since requested that Google pin their database to prevent further live migrations and is exploring alternative solutions like traditional replica promotion for database maintenance to avoid future disruptions. Recognizing the importance of reliability, Clerk has shifted its engineering focus to enhance system dependability and regain customer trust, acknowledging the critical role their service plays in their clients' infrastructure.