How we stopped memory intensive queries from crashing ElasticSearch

Company

Plaid

Date Published

June 21, 2019

Author

Angela Zhang

Word count

1656

Language

English

Hacker News points

None

URL

plaid.com/blog/how-we-stopped-memory-intensive-queries-from-crashing-elasticsearch

Summary

We investigated the repeated ElasticSearch outages at Plaid, which were caused by memory-intensive queries crashing data nodes and bringing down the cluster. The root cause was identified as user-written queries aggregating over a large number of buckets, causing individual counters to take up too much memory on each data node. To address this issue, we configured request memory circuit breakers to cap memory usages for individual queries and limited the number of buckets ElasticSearch would use for aggregations. We also worked with AWS support to update the cluster settings, which allowed us to prevent similar issues in the future.