Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul
Blog post from Modular
In this blog post, the authors delve into optimizing matrix multiplication (matmul) on modern GPUs by leveraging advanced programming techniques and hardware features, such as Tensor Memory Accelerator (TMA), tensor cores, and shared memory swizzling, to achieve significant performance gains. The post outlines the use of loop tiling to reduce global memory access, employing shared memory to cache data blocks, and introducing swizzling to resolve shared memory bank conflicts, thus enhancing computational throughput. The authors explain the implementation of a more efficient matmul kernel using TMA for asynchronous data transfers, tensor cores for matrix-multiply-accumulate operations, and a new memory called Tensor Memory (TMEM) to minimize register pressure. Despite achieving a 58x improvement over a naive kernel, the optimized kernel still lags behind cuBLAS performance, but sets the groundwork for future enhancements through pipelining and overlapping operations. The authors conclude by indicating that further improvements in execution scheduling and algorithm design will be explored in subsequent posts, aiming to close the performance gap with state-of-the-art solutions.