How we scaled our logging stack by creating per-team budgets

Company

Plaid

Date Published

March 19, 2021

Author

Summer Devlin

Word count

1620

Language

English

Hacker News points

None

URL

plaid.com/blog/how-we-scaled-our-logging-stack-by-creating-per-team-budgets

Summary

The ElasticSearch (ES) cluster is a critical component of real-time observability at Plaid, but it faces scalability challenges due to high log volume spikes. The team designed safety rails to prevent resource-intensive queries from crashing ES and implemented a solution to protect the cluster from surges in log ingestion volume. They explored two classes of solutions: dropping logs as necessary to ensure volume doesn't exceed a certain threshold (N messages/second) or queuing excess logs to prevent accidental data loss during spikes in log volume. The queueing solution was chosen, which involves establishing a budget per service and only emitting N messages/second to ES, with any remaining messages queued for later processing. This approach has improved the reliability of the ELK stack, allowing engineers to move faster and deploy with more certainty, but also increases the number of Kinesis shards and Logstash pods, requiring additional monitoring and alerting rules.