6 Strategies to Survive Apache Kafka® Broker Failures & Other Outages

Company

Confluent

Date Published

April 27, 2021

Author

Jakub Korab

Word count

2848

Language

English

Hacker News points

URL

www.confluent.io/blog/how-to-survive-a-kafka-outage

Summary

The text discusses the critical role of Apache Kafka in maintaining the availability of applications that rely on it for data ingestion and processing, emphasizing the challenges posed by potential broker failures and cluster outages. It highlights the importance of operational best practices, such as appropriate broker placement and replication settings, to mitigate these risks and ensure resilience. The text also explores various strategies for handling outages, including buffering, message storage, and load shedding, while cautioning against the pitfalls of backpressure and the complexities of maintaining order and data integrity during recovery. These strategies are particularly relevant for high-throughput applications where the cost of outages can be significant, both in terms of business loss and regulatory penalties. Additionally, the text underscores the necessity of monitoring Kafka clusters to detect and address issues proactively and suggests that organizations should consider the specific transaction models and data value when designing reliability mechanisms.