The world's fastest unified matrix multiplication
Blog post from Modular
Modular has introduced a novel approach to addressing the challenges of AI compute fragmentation with a focus on matrix multiplication ("matmul"), offering a unified and extensible solution that improves upon existing kernel libraries. Traditional libraries, often hardware-specific and monolithic, struggle with issues of composability, portability, and efficiency, particularly as AI relies on diverse parallel hardware architectures. Modular's solution consolidates multiple bespoke implementations into a "Single Source of Truth," allowing for adaptable, high-performance kernels that are architecture-agnostic, dynamic shape-friendly, and support extensive operator fusion without the need for a compiler engineer. This results in significant performance gains across various hardware platforms, surpassing state-of-the-art solutions like OneDNN and AOCL on Intel, AMD, and ARM systems. By adopting a first-principles approach and embracing fusion, Modular aims to simplify AI infrastructure, enhance user experience, and enable rapid adaptation to new hardware, fostering broader accessibility and innovation in AI technology.