Improving database resilience with observability and chaos testing
Blog post from New Relic
In a rapidly evolving web ecosystem, chaos engineering is crucial for building resilient systems by deliberately introducing failures to expose potential weaknesses before they lead to real-world issues. New Relic employs weekly chaos experiments in pre-production environments, particularly focusing on complex systems like Amazon Aurora databases, to test and improve system resilience. These experiments help validate failover processes, enhance application robustness, mitigate outages, and optimize performance by emphasizing stress points and understanding system capacity limits. Effective observability, using tools such as New Relic's infrastructure agent, APM, and CloudWatch metrics, is essential for monitoring both client-side and server-side activities, ensuring quick incident response and refining resilience strategies. Proper database driver configurations that adhere to AWS's DNS TTL policies are also necessary to manage failovers efficiently. As chaos experiments grow in scope, they allow for more sophisticated simulations and further improvements in system robustness.