Home / Companies / Netlify / Blog / Post Details
Content Deep Dive

Service degradation report from September 15, 2020

Blog post from Netlify

Post Details
Company
Date Published
Author
David Calavera
Word Count
1,428
Language
English
Hacker News Points
-
Summary

This incident highlights the challenges of deploying changes in a microservices-based infrastructure, particularly when dealing with high traffic and complex interactions between services. The upgrade to the database version introduced an unexpected variable that was not thoroughly validated, leading to a cascade effect across the cluster. To mitigate such incidents in the future, Netlify is implementing measures such as improved automatic canary analysis and anomaly detection, as well as evaluating ways to capture and replay large amounts of traffic for stress testing. The company is also working on isolating microservices and reducing their reliance on a single database, aiming to prevent similar cascade effects. Despite the incident, Netlify takes customer trust seriously and apologizes for the downtime, emphasizing its commitment to providing world-class service.