Won't Get Fooled Again

Post Details

Company

Gremlin

Date Published

Oct. 2, 2017

Author

Gremlin

Word Count

1,058

Language

English

Hacker News Points

-

Source URL

www.gremlin.com/blog/wont-get-fooled-again

Summary

On an uneventful Friday afternoon, the Gremlin team faced an unexpected system failure as their service began producing 5xx errors due to a disk space issue caused by improper log rotation. Despite the disruption, the team quickly identified and resolved the problem, highlighting their commitment to never repeating the same failure twice. This incident led to the creation of the Disk Gremlin, a tool designed to simulate real-world failure modes and prevent future occurrences by implementing proactive measures such as publishing custom disk utilization metrics to CloudWatch and automating scheduled attacks to test system resilience. This approach aligns with the practice of "Continuous Chaos," which involves regularly testing systems against potential failure modes to ensure reliability, similar to regression testing in software development. Gremlin's philosophy emphasizes the importance of building testable and reliable systems by anticipating and simulating chaos in their operations.