Home / Companies / Cloudflare / Blog / Post Details
Content Deep Dive

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

Blog post from Cloudflare

Post Details
Company
Date Published
Author
James Morrison and Christian Endres
Word Count
2,167
Language
English
Hacker News Points
-
Summary

Cloudflare experienced significant performance issues with its ClickHouse database when migrating to a new partitioning scheme designed to allow per-namespace data retention, which initially slowed down daily aggregation jobs crucial for billing. This was caused by an unexpected bottleneck in query planning due to increased lock contention and inefficient part filtering, exacerbated by the sheer volume of data parts. To resolve this, a series of optimizations were implemented: switching to a shared lock to reduce contention, deferring unnecessary vector copying to improve performance, and employing a binary search to expedite part filtering. These changes significantly improved query durations and resolved the immediate crisis, although the experience highlighted the complexities and potential pitfalls of large-scale data architecture decisions, leaving open the question of whether further architectural changes might be necessary in the future.