Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

Post Details

Company

Modular

Date Published

Sept. 12, 2025

Author

Ali Taha

Word Count

3,144

Language

English

Hacker News Points

-

Source URL

www.modular.com/blog/matrix-multiplication-on-nvidias-blackwell-part-3-the-optimizations-behind-85-of-sota-performance

Summary

The text delves into advanced techniques to optimize matrix multiplication performance on NVIDIA Blackwell GPUs, specifically achieving up to 85% of the state-of-the-art (SOTA) performance through leveraging the 2SM technique and pipelining. It explains how Streaming Multiprocessors (SMs) can be grouped to access shared memory, enabling optimizations like Tensor Memory Accelerator (TMA) multicasting. This allows SMs to collaborate on loading tiles into shared memory, reducing redundancy and improving efficiency. The document also discusses the 2xSM Matrix Multiply-Accumulate (MMA), where two SMs coordinate to perform large MMA operations using shared memory inputs, thereby decreasing memory traffic. Further optimizations include pipelining MMA and TMA operations using warp specialization and implementing a circular buffer to increase overlap between computation and memory transfer. This is enhanced by double-buffering the output write-out, which allows simultaneous data transfers, thus achieving significant performance gains. The text concludes by indicating future steps to bridge the remaining performance gap to full SOTA, hinting at using features like cluster launch control for persistent kernels.