Home / Companies / ScyllaDB / Blog / Post Details
Content Deep Dive

Hunting a NUMA Performance Bug

Blog post from ScyllaDB

Post Details
Company
Date Published
Author
Michał Chojnowski
Word Count
3,446
Language
English
Hacker News Points
-
Summary

The blog post by Michał Chojnowski delves into the process of troubleshooting a performance issue encountered when running ScyllaDB on Oracle Cloud's ARM-based Ampere A1 servers. Despite the initial suspicion that the problem was hardware-related, it was discovered that the issue stemmed from a software-level CPU bottleneck, where certain runs of the database exhibited significantly reduced throughput. Through meticulous investigation using various performance monitoring tools, it was revealed that the low throughput was due to cache line invalidations caused by a shared global tree node, leading to substantial performance degradation. The author highlights the complexity of identifying such subtle issues, as the bottleneck was not apparent from traditional sampling methods and required a detailed understanding of the interactions between software and NUMA architecture. This deep dive exemplifies the challenges developers might face when optimizing applications for ARM-based platforms and underscores the importance of considering memory access patterns and their impact on performance.