How we cut ClickHouse latency from 12s to 2s
Blog post from Mux
ClickHouse, known for its efficiency in handling large-scale data ingestion and aggregation, faced performance bottlenecks in a scenario involving real-time data ingestion through Kafka, despite having only 60% CPU utilization. This issue was identified as a trade-off between latency and throughput, with latency not being properly measured initially, leading to delays in data appearing on real-time dashboards. Through experimentation, it was discovered that the bottleneck was due to the inefficiency in parsing the protobuf single format. By switching to a batched format and adjusting the Kafka flush interval, the team reduced ingestion latency from 12 seconds to 2-6 seconds while maintaining high throughput and manageable CPU usage. These changes highlighted the importance of monitoring both latency and throughput to avoid blind spots in performance metrics, offering insights for others using ClickHouse’s Kafka Table Engine.