The case for Fault Injection testing in Production
Blog post from Gremlin
Organizations often begin Fault Injection testing in non-production environments to understand system behavior in failure scenarios, but this approach may not capture the complexities and real-world challenges of production settings. While non-production testing offers a safer environment without impacting customer traffic, it can lead to false positives due to differences in scale, configuration, and traffic representation. Production testing, despite its risks, provides more accurate insights into system resilience and customer impact, especially when conducting Verification testing for known failure modes. A balance between testing in both environments is crucial, as non-production testing helps identify potential failures and prepare systems, but only production testing can validate true reliability. Ultimately, the goal is to enhance system resilience by progressively incorporating production testing to address real-world conditions effectively.