Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
Blog post from HuggingFace
Profiling in PyTorch can be a daunting task due to its complexity and the dense traces it produces, but understanding how to navigate these traces is crucial for optimizing machine learning models. This introductory guide to using torch.profiler aims to demystify the process by starting with a fundamental operation—matrix multiplication followed by bias addition—and teaching how to interpret profiler outputs to drive optimization. The guide explains how to set up torch.profiler, read the profiler table and trace, and understand the chain of events from Python calls to CUDA kernel execution. It highlights common profiling challenges, such as overhead-bound algorithms and CPU-GPU offsets, and provides insights into operator fusion at the dispatcher level, as seen when using torch.compile. The guide emphasizes that while torch.compile offers potential performance enhancements, it also introduces additional CPU overheads that only amortize over larger workloads. By the end of this guide, readers will have a foundational understanding of how to use profiling tools in PyTorch to identify and address performance bottlenecks in their code, setting the stage for more advanced profiling techniques in subsequent parts of the series.