Breaking to Learn: Chaos Engineering Explained

Company

New Relic

Date Published

Jan. 10, 2019

Author

Fredric Paul, Editor in Chief

Word count

1773

Language

English

Hacker News points

None

URL

newrelic.com/blog/best-practices/chaos-engineering-explained

Summary

Chaos engineering is a discipline that involves proactively causing failures in random places at random intervals throughout systems to build confidence in their ability to withstand turbulent conditions in production. This approach was first adopted by Netflix, which migrated from an on-premise stack to a distributed cloud-based architecture and introduced tools like Chaos Monkey to test the resilience of its systems. Chaos engineering is not about randomly causing chaos but rather about carefully injecting harm into systems to test their response to failure. The goal is to minimize the blast area and ensure that business operations are not significantly impacted. It involves planning meticulously, having the right team in place, and simulating scenarios that have the potential to make systems become unavailable or cause performance degradation. Chaos engineering can be thought of as a formal method to generate new knowledge about complex systems and uncover systemic weaknesses. By using chaos engineering, companies can improve the reliability of their modern architectures and prepare for outages before they occur.