How we cut ClickHouse latency from 12s to 2s

Post Details

Company

Mux

Date Published

Nov. 26, 2024

Author

Nidhi Kulkarni

Word Count

2,516

Language

English

Hacker News Points

-

Source URL

www.mux.com/blog/latency-and-throughput-tradeoffs-of-clickhouse-kafka-table-engine

Summary

ClickHouse, known for its efficiency in handling large-scale data ingestion and aggregation, faced performance bottlenecks in a scenario involving real-time data ingestion through Kafka, despite having only 60% CPU utilization. This issue was identified as a trade-off between latency and throughput, with latency not being properly measured initially, leading to delays in data appearing on real-time dashboards. Through experimentation, it was discovered that the bottleneck was due to the inefficiency in parsing the protobuf single format. By switching to a batched format and adjusting the Kafka flush interval, the team reduced ingestion latency from 12 seconds to 2-6 seconds while maintaining high throughput and manageable CPU usage. These changes highlighted the importance of monitoring both latency and throughput to avoid blind spots in performance metrics, offering insights for others using ClickHouse’s Kafka Table Engine.