Home / Companies / ClickHouse / Blog / Post Details
Content Deep Dive

The case of the vanishing CPU: A Linux kernel debugging story

Blog post from ClickHouse

Post Details
Company
Date Published
Author
Sergei Trifonov
Word Count
7,345
Language
English
Hacker News Points
47
Summary

A mysterious CPU spike in ClickHouse Cloud on GCP led to months of debugging, revealing a deeper issue within the Linux kernel's memory management. The investigation began with an occasional hiccup in ClickHouse Cloud infrastructure that engineers struggled to explain. Initially, it seemed like a random performance degradation issue, but further analysis revealed a hidden livelock caused by excessive contention on the `mmap_lock` spinlock. The lock was held for an exceptionally long time during page fault handling, scanning 1,093,267 pages in an effort to reclaim memory for the cgroup. This led to a situation where nearly all pages were activated, but only 32 pages were successfully reclaimed. A new bug was later discovered, which involved a spinlock called `lru_lock` protecting the struct `lruvec`. The fix for this issue involved enabling the Multi-Gen LRU (MGLRU) mechanism, which is designed to improve memory management and reduce contention on the spinlock. This change resolved the issue and improved overall system performance.