/plushcap/analysis/cloudflare/intelligent-automatic-restarts-for-unhealthy-kafka-consumers

Intelligent, automatic restarts for unhealthy Kafka consumers

What's this blog post about?

When building distributed systems with Kubernetes, one common issue is ensuring the health of all components. In a system where microservices consume data from Apache Kafka topics, liveness checks can be used to ensure that consumers are actively processing messages. A naive approach is to use simple Kafka connectivity checks, but this may not be enough for systems with multiple partitions and replicas. To improve health checks, focus on message ingestion by checking the current offset (the last message sent) and the committed offset (the last message processed). By ensuring that the committed offset is changing and is equal to or behind the latest one, we can determine whether a consumer is actively processing messages. One issue with this approach is that rebalances in Kafka can cause consumers to be reassigned different partitions, leading to incorrect health checks if each instance of a service only keeps track of its assigned offsets. To solve this problem, use the Sarama library's functionality to observe when a rebalance happens and update the in-memory map of offsets accordingly. Overall, smart health checks can help prevent cascading failures in Kubernetes systems by ensuring that microservices are actively processing messages from Apache Kafka topics.

Company
Cloudflare

Date published
Jan. 24, 2023

Author(s)
Chris Shepherd, Andrea Medda

Word count
1737

Hacker News points
2

Language
English


By Matt Makai. 2021-2024.