May 18 Replit downtime

Company

Replit

Date Published

May 19, 2023

Author

Luis Héctor Chávez

Word count

1006

Language

English

Hacker News points

URL

blog.replit.com/may-18-replit-downtime

Summary

The system experienced a two-hour downtime due to a latent bug introduced during the migration of configuration to Replit's VMs in 2021, which caused a deadlock when trying to acquire read-locks on certain configurations. The issue was triggered by a new configuration kind introduced on May 16, 2023, without adding a handler, resulting in read-lock leaks. The downtime started at 11:45 PM Pacific Time and lasted until 1:56 AM the next day, with the system gradually recovering as more capacity was brought online. After identifying the root cause, which involved a deadlock caused by write-preferring Golang RwMutexes, the team rolled back to the last known good build and added extra capacity to accommodate for the increased load. The recovery process was hindered by the auto-scaler's inability to properly handle incoming requests, but the team implemented changes such as reducing retries and scaling up instance groups to improve performance. A new build with a bugfix for the deadlock was pushed to production shortly after the incident, ensuring that the issue would not be triggered again. To prevent similar incidents in the future, the team plans to implement changes such as faster and more reliable rollbacks, intent-based CD systems, regular disaster and recovery training exercises, staggered deployment of configuration changes, revamped healthiness and liveness logic, and improved static analysis.