Status page
Blog post from Replicate
Replicate experienced a significant outage on May 11, 2023, affecting both its website and API, primarily due to issues with its PostgreSQL database, which is central to its platform. The outage was triggered by a newly implemented asynchronous update feature that inadvertently caused simultaneous INSERT queries for the same prediction ID, leading to increased query latencies and eventual exhaustion of the database connection pool. Despite initial assumptions that the previous day's database resizing might have been the cause, the real issue stemmed from the asynchronous updates creating a dangerous query pattern. As a result, the replicate.com website faced intermittent failures, although customers using webhooks to receive updates were less affected. The company responded by disabling the problematic feature, which restored normal operations, and is now focused on redesigning the feature to avoid similar issues in the future. This experience has provided valuable insights into the system's constraints, prompting a review of potential database hotspots and a determination to improve resilience against database disruptions.