Company
Date Published
Author
Chris McCraw
Word count
894
Language
English
Hacker News points
None

Summary

The downtime of a service's database, which occurred on November 22nd, was caused by the database filling up its disk space due to rapid growth in file deployment. The team had noticed an upward trend in database size days earlier and began preparing for potential issues, but ultimately failed to migrate data to a larger partition in time. Despite this, the CDN edge nodes continued to serve content, minimizing the impact of the outage. The team has since analyzed the situation, identified causes, and implemented measures to prevent similar outages, including revamping their monitoring system, deploying a new status page with incident history, and modifying their replication method to reduce space usage. They have also redesigned their database handling practices to ensure a live master and slave during potentially impactful operations.