Navigating Kafka’s auto-commit: avoiding duplication and data loss
Blog post from New Relic
New Relic's use of Apache Kafka has unveiled challenges, particularly with Kafka's consumer auto-commit configuration, which can lead to data loss or duplication during consumer service failures like out-of-memory (OOM) kills. Kafka's auto-commit setting automatically commits consumed messages at set intervals, simplifying offset management but risking message reprocessing if a consumer fails. While manual offset management offers precision, it adds complexity. Kafka version 0.11 introduced improvements for exactly-once processing, yet it imposes additional latency. Alternative solutions, such as using streaming frameworks like Flink or Kafka Streams, offer exactly-once processing, although some users may accept minor data inconsistencies due to the engineering complexities involved in completely resolving the issue. New Relic emphasizes building stable services and considering other streaming systems to mitigate these risks while acknowledging the trade-offs between engineering cost and data reliability.