In this blog post, the development and optimization of a state-of-the-art matrix multiplication (matmul) kernel on NVIDIA's Blackwell GPU architecture are explored, focusing on Cluster Launch Control (CLC) optimization. By employing persistent kernels to reduce overhead costs and introducing a hardware-managed scheduler, the implementation achieves significant performance improvements, reaching 1772 TFLOPs, which surpasses the current state-of-the-art. The post details the persistent kernel's ability to optimize scheduling by controlling block tile coordinates, discusses the pipelining of CLC fetches to overlap tasks and address scheduling overheads, and employs techniques like block swizzling to enhance L2 cache efficiency. The optimization journey, including the use of a circular buffer for Tensor Memory and adapting parameters for production shapes, demonstrates how the kernel can achieve 100.6% of cuBLAS performance for specific matrix shapes. This series illustrates the sophisticated programming techniques required to leverage advanced GPU features for peak performance, with future posts promising further insights into high-performance coding with Mojo.