Company
Date Published
Author
Aaditya Talwai, Emily Chang
Word count
1557
Language
English
Hacker News points
1

Summary

This exercise was conducted by Datadog to test the resilience of their Elasticsearch cluster. The team ran a game day, intentionally triggering failures in their services to observe how they responded in a controlled environment. They tested stopping individual nodes, including the leader node, and observed that while indexing and search requests returned 503s for a brief period during the leader transition, the applications recovered with minimal impact due to their buffering strategy. The team also discovered an issue with "dangling indices" that caused the cluster to go RED after deleting indices without acknowledging it across all nodes. Regular health checks for client nodes were found to be key in preventing issues. The exercise provided valuable insights into the cluster's behavior and highlighted the importance of monitoring and maintenance.