Testing part 5: Longevity testing
Blog post from ScyllaDB
ScyllaDB's longevity testing, a crucial component of its testing suite, focuses on evaluating the stability and reliability of ScyllaDB clusters over extended periods, using a variety of stress tests and fault injections to identify potential issues in long-running deployments. This integration test, unlike unit tests, assesses the entire system as deployed on AWS, including a cluster of nodes subjected to stress via tools like cassandra-stress and disruptions orchestrated by a component called Nemesis. These disruptions mimic real-world failures and include actions such as stopping and restarting instances, data corruption, and node decommissioning. The testing process aims to ensure that the clusters can maintain functionality despite these induced failures, with successful tests indicating resilience and robustness of the system. However, the tests have revealed critical issues such as nodes running out of space and failures in node decommissioning. Future plans for the longevity test suite include expanding to multiple cloud providers and incorporating multi-datacenter testing capabilities, with the source code available on GitHub for community involvement and further development.