Company
Date Published
Author
Summer Devlin
Word count
1620
Language
English
Hacker News points
None

Summary

The ElasticSearch (ES) cluster is a critical component of real-time observability at Plaid, but it faces scalability challenges due to high log volume spikes. The team designed safety rails to prevent resource-intensive queries from crashing ES and implemented a solution to protect the cluster from surges in log ingestion volume. They explored two classes of solutions: dropping logs as necessary to ensure volume doesn't exceed a certain threshold (N messages/second) or queuing excess logs to prevent accidental data loss during spikes in log volume. The queueing solution was chosen, which involves establishing a budget per service and only emitting N messages/second to ES, with any remaining messages queued for later processing. This approach has improved the reliability of the ELK stack, allowing engineers to move faster and deploy with more certainty, but also increases the number of Kinesis shards and Logstash pods, requiring additional monitoring and alerting rules.