Home / Companies / Webflow / Blog / Post Details
Content Deep Dive

April 14 outage: what happened, and what’s next

Blog post from Webflow

Post Details
Company
Date Published
Author
Allan Leinwand
Word Count
859
Language
English
Hacker News Points
-
Summary

Webflow experienced a significant service disruption when a CMS database cluster went offline due to an undocumented capacity limit on their cloud provider's database engine, which was not visible in their metrics. This issue affected access to key Webflow services, though cached sites remained available. The problem arose from the database engine silently reserving logical space, leading to a crash loop as it exceeded its 128 TiB allocation cap. The Webflow team responded quickly, deploying a fix to remove the affected cluster, restoring service to 96% of customers by 9:21 am PT, and fully resolving the issue by 9:01 pm PT with no data loss. They collaborated closely with their cloud provider to double the storage allocation and implemented new monitoring and alerting systems to prevent future occurrences. Additionally, Webflow is working on broader reliability improvements, such as modifying application handling of database failures and upgrading database engine versions across all clusters to enhance storage limits and prevent cascading failures. They are committed to ensuring the platform's reliability and will continue to provide updates on their status page.