Home / Companies / Tinybird / Blog / Post Details
Content Deep Dive

Resolving a year-long ClickHouse ® lock contention

Blog post from Tinybird

Post Details
Company
Date Published
Author
Jordi Villar
Word Count
1,704
Language
English
Hacker News Points
-
Summary

In an endeavor to address an ongoing issue with their ClickHouse® cluster, Tinybird engineers managed to resolve a longstanding problem that had limited query concurrency and underutilized CPU resources. Initially, despite high demand, the system's CPU usage remained below 20%, prompting a series of temporary fixes over the course of a year. The breakthrough came when they identified a spike in ContextLockWait events, which led to a significant code refactor involving the replacement of a global mutex with read-write mutexes to reduce contention in the ClickHouse® database. This refactor, coupled with a new metric to monitor Context lock impact, resulted in a dramatic performance improvement, increasing query throughput and CPU utilization to 100% in testing. Although the engineers do not expect a fivefold improvement in production due to potential bottlenecks, even a 1.5x increase in performance will significantly benefit Tinybird's infrastructure and its clients. With these changes incorporated into the ClickHouse® 23.10 release, the company anticipates enhanced performance for their most demanding clients.