Safe schema updates - Resilient vs robust IT systems
Blog post from Octopus Deploy
In this article, Alex Yates explores the concept of resilience versus robustness in IT systems, emphasizing the importance of building systems that are prepared for failures rather than solely focusing on preventing them. Through the lens of complex systems theory, Yates highlights the distinction between complex and complicated systems, noting that complex systems, which include human elements, are inherently unpredictable. Drawing on insights from various fields like DevOps and Safety 2.0, the article argues for a shift from traditional Safety-I approaches, which aim to prevent failures, to Safety-II approaches that focus on ensuring systems succeed despite failures. Yates uses Netflix as an example of a resilient IT system that can withstand failures by designing for them, incorporating redundancy, and practicing failure scenarios to maintain core operations. He advocates for frequent, smaller updates and effective containment of failures through strategies like loose coupling and the Strangler Pattern, which can improve database reliability and reduce risk. The article sets the stage for future discussions on continuous integration and other best practices for enhancing IT system safety and reliability.