Learning Review for our 22 November API and Origin outage

Post Details

Company

Netlify

Date Published

Dec. 1, 2016

Author

Chris McCraw

Word Count

894

Language

English

Hacker News Points

-

Source URL

www.netlify.com/blog/2016/12/01/learning-review-for-our-22-november-api-and-origin-outage

Summary

The downtime of a service's database, which occurred on November 22nd, was caused by the database filling up its disk space due to rapid growth in file deployment. The team had noticed an upward trend in database size days earlier and began preparing for potential issues, but ultimately failed to migrate data to a larger partition in time. Despite this, the CDN edge nodes continued to serve content, minimizing the impact of the outage. The team has since analyzed the situation, identified causes, and implemented measures to prevent similar outages, including revamping their monitoring system, deploying a new status page with incident history, and modifying their replication method to reduce space usage. They have also redesigned their database handling practices to ensure a live master and slave during potentially impactful operations.