Company
Date Published
Author
Casey Huang
Word count
730
Language
English
Hacker News points
None

Summary

Pulumi Cloud experienced a 24-minute outage on October 6, 2023, due to a database migration error that resulted in a significant portion of requests not being processed. The migration involved modifying foreign keys on a table, but the pre-production testing did not accurately simulate the production environment's traffic load, leading to a table lock that starved the database of connections. This incident highlighted the importance of accounting for even low-traffic tables in scaling considerations, as the block on a less busy table still caused significant disruption. In response, Pulumi has conducted a postmortem and is implementing improvements in its Software Development Lifecycle, including enforcing safe SQL constraints, enhancing pre-production testing fidelity, and creating new tools for testing database performance. Additionally, Pulumi plans to split its service into separate failure domains to preserve core workload functionality during similar incidents. Pulumi emphasizes its commitment to transparency and operational excellence, apologizing to users for the disruption and pledging to learn and improve from this experience.