Monitoring Kafka performance metrics

Company

Datadog

Date Published

April 6, 2016

Author

Evan Mouzakitis, David M. Lentz

Word count

5423

Language

English

Hacker News points

URL

www.datadoghq.com/blog/monitoring-kafka-performance-metrics

Summary

Kafka is a distributed, partitioned, replicated log service developed by LinkedIn and open sourced in 2011. It's designed for handling real-time data feeds of large companies. Kafka has several key differences from other message queue systems like RabbitMQ, ActiveMQ, or Redis's Pub/Sub. These include being a replicated log service, using a custom binary TCP-based protocol, being very fast even with small clusters, having strong ordering semantics and durability guarantees. Many organizations use Kafka, including LinkedIn, Pinterest, Twitter, and Datadog. The latest release is version 2.4.1. A Kafka deployment consists of brokers that act as intermediaries between producer applications and consumer applications. Producers push messages to brokers in batches, while consumers pull messages from the log at their own rate. Messages are organized into topics, which store related messages, and partitions are assigned to brokers. The greater the number of partitions, the more concurrent consumers a topic can support. Kafka's replication feature provides high availability by persisting each partition on multiple brokers. ZooKeeper is used in Kafka deployments for maintaining information about Kafka's brokers and topics, applying quotas to govern traffic, and storing replicas. Monitoring ZooKeeper metrics is key to maintaining a healthy Kafka cluster. Key metrics include outstanding requests, average latency, number of alive connections, pending syncs, bytes sent/received, usable memory, swap usage, and disk latency. A properly functioning Kafka cluster can handle significant amounts of data, but monitoring health and performance is crucial for reliable performance from dependent applications.