Scaling a monitoring platform to over 9,600 bank integrations

Company

Plaid

Date Published

June 28, 2018

Author

Joy Zheng & Jeeyoung Kim

Word count

2028

Language

English

Hacker News points

None

URL

plaid.com/blog/scaling-a-monitoring-platform

Summary

We built a monitoring system to support over 9,600 bank integrations at Plaid. Our core promise is providing reliable and homogenous data from all financial institutions. We initially struggled with our legacy logging-based monitoring system due to the diversity of institutions and the need for customizability. After determining our requirements, we identified metrics such as success/failure counts, latency spikes, and data quality degradation. The technical requirements included scalability, latency, usability, event transport, time series database, alerting, and visualization. We chose Prometheus and Alertmanager due to their flexibility and scalability. A standard pipeline was built for services that export straightforward metrics, while a custom pipeline was created for services that require more complex metric generation. Our monitoring system has reached 190k+ metrics exported, 700 events per second processed, and 31 engineers have contributed to monitoring configuration changes. We learned the importance of building the end-to-end pipeline first, making components independently usable, tailoring aggregation to narrow areas, using standard components where possible, and investing in developer education.