Learning Review for our Feb 2 Origin Outage

Post Details

Company

Netlify

Date Published

Feb. 3, 2017

Author

Matt Biilmann

Word Count

1,371

Language

English

Hacker News Points

4

Source URL

www.netlify.com/blog/2017/02/03/learning-review-for-our-feb-2-origin-outage

Summary

The company recently experienced a serious incident caused by human error, resulting in the overwriting of metadata for 5.2% of their deploys, leading to downtime for affected sites and API failures. The team responded quickly, focusing on getting paying customers back online first, while also working on restoring deploy metadata from backups and fixing API serialization errors. They implemented several measures to prevent similar incidents, including peer signoff for custom write queries, database snapshots before running write queries, and enforced auditing and reviews of database interface commands. To improve recovery, they plan to log more deploy metadata, update their internal documentation, and trigger rebuilds through a user-friendly interface.