Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse
Blog post from Cloudflare
Cloudflare experienced significant performance issues with its ClickHouse database when migrating to a new partitioning scheme designed to allow per-namespace data retention, which initially slowed down daily aggregation jobs crucial for billing. This was caused by an unexpected bottleneck in query planning due to increased lock contention and inefficient part filtering, exacerbated by the sheer volume of data parts. To resolve this, a series of optimizations were implemented: switching to a shared lock to reduce contention, deferring unnecessary vector copying to improve performance, and employing a binary search to expedite part filtering. These changes significantly improved query durations and resolved the immediate crisis, although the experience highlighted the complexities and potential pitfalls of large-scale data architecture decisions, leaving open the question of whether further architectural changes might be necessary in the future.