Big latencies? It’s open season for kernel bug-hunting!
Blog post from ScyllaDB
ScyllaDB's blog post by Glauber Costa delves into the challenges of diagnosing and resolving unexpected high latencies in the database's performance, attributed to a bug in the Linux kernel's CFQ I/O elevator. When ScyllaDB users experienced unpredictable write timeouts, Costa and the team embarked on a thorough investigation using tools like systemtap and eBPF to track down the source of the latency spikes. They discovered that flush operations were being delayed due to the serialization of blocking syscalls, with file opening operations taking an unusually long time because of locks within the XFS filesystem. The root cause was identified as a starvation problem within the CFQ scheduler, which deferred requests for an unacceptably long duration, up to 15 seconds. The team addressed the issue by switching to the noop I/O elevator in ScyllaDB's 1.4 release and contributing a patch to the Linux kernel to resolve the bug. This fix was subsequently integrated into upstream Linux and backported to stable kernel versions, providing a more stable experience for ScyllaDB users.