Apache Kafka Lag Monitoring at AppsFlyer
Blog post from Confluent
At AppsFlyer, a SaaS mobile marketing platform, visibility is crucial for monitoring its distributed systems, particularly for managing Apache Kafka, which is integral to its large-scale event-driven architecture. With Kafka facilitating the streaming of tens of billions of events daily, AppsFlyer recognized a gap in monitoring Kafka lag, which indicates how far behind a consumer is in processing data. Previously relying on a cumbersome Clojure service, AppsFlyer sought a more scalable, automated solution. After evaluating options like Kafka Lag Exporter and Remora, they opted for LinkedIn's Burrow due to its flexibility, modular design, and capability to monitor consumer lag effectively. Burrow's integration allows AppsFlyer to monitor clusters, visualize lag metrics, and develop time-based metrics to anticipate potential data loss due to retention issues. The team aims to enhance their system further by creating smart alerts and decoupling Burrow stacks to manage growing cluster traffic efficiently.