Resolving a year-long ClickHouse ® lock contention
Blog post from Tinybird
In an endeavor to address an ongoing issue with their ClickHouse® cluster, Tinybird engineers managed to resolve a longstanding problem that had limited query concurrency and underutilized CPU resources. Initially, despite high demand, the system's CPU usage remained below 20%, prompting a series of temporary fixes over the course of a year. The breakthrough came when they identified a spike in ContextLockWait events, which led to a significant code refactor involving the replacement of a global mutex with read-write mutexes to reduce contention in the ClickHouse® database. This refactor, coupled with a new metric to monitor Context lock impact, resulted in a dramatic performance improvement, increasing query throughput and CPU utilization to 100% in testing. Although the engineers do not expect a fivefold improvement in production due to potential bottlenecks, even a 1.5x increase in performance will significantly benefit Tinybird's infrastructure and its clients. With these changes incorporated into the ClickHouse® 23.10 release, the company anticipates enhanced performance for their most demanding clients.