Continuous NVIDIA CUDA Profiling In Production
Blog post from Polar Signals
Polar Signals has released an open-source CUDA profiler integrated into the parca-agent, designed for low-overhead, always-on profiling in production environments. Traditional profiling tools like NVIDIA Nsight provide detailed insights but can significantly impact performance due to their invasiveness. To address this, Polar Signals' solution leverages the CUPTI profiling API, USDT probes, and eBPF to create a streamlined pipeline that captures GPU performance data with minimal intrusion. By using a shim library called parcagpu, the profiler intercepts CUDA API calls and uses USDT probes to gather timing and context information, which is then efficiently captured by eBPF and transferred to userspace via perf event buffers. This method not only supports both regular and graph kernel launches but also allows for detailed contextual labeling, making it suitable for continuous profiling of CUDA applications across AMD64 and ARM64 architectures. The profiler provides a comprehensive view of GPU workload performance without the need for filesystem or network overhead, and it is activated simply by setting the CUDA_INJECTION64_PATH environment variable and running the parca-agent with CUDA instrumentation enabled.