Open-Source Low-Overhead NVIDIA CUDA PC Sampling
Blog post from Polar Signals
The CUDA Profiling Tools Interface (CUPTI) has expanded its capabilities by incorporating Program Counter (PC) sampling, which allows developers to analyze CUDA program performance at the instruction level, identifying stall reasons and optimizing code efficiency. This feature, traditionally used in developer tools like NVidia NSight, can now be applied in production settings thanks to a low-overhead continuous profiler that minimizes performance impacts. PC sampling utilizes dedicated hardware to record the state of each GPU warp at configurable intervals, capturing PC offsets and stall reasons without timestamps or call stacks. The implementation involves a dynamic algorithm that periodically enables and disables PC sampling to maintain efficient data collection, while a shim library interfaces with the CUPTI to manage and transmit data to a backend for analysis. The data, collected in PC/stall-reason pairs, is processed and symbolized on the backend to provide detailed insights into GPU stalls, enhancing the utility of continuous profiling tools like Polar Signals. This advancement allows users to maintain a comprehensive production-level profiling environment, capturing valuable instruction-level GPU insights for performance optimization.