MiniMax Goes Sparse: Decoding M3's Attention from a Single Diagram
Blog post from HuggingFace
MiniMax's new architecture, M3, introduces a sparse attention mechanism that promises significant speed improvements, with 9.7× prefill and 15.6× decode speedup at 1 million tokens, as depicted in a diagram shared by R&D lead Skyler Miao. This advancement is part of a shift from the M2 model's full attention approach, which lacked the production readiness of M1's Lightning Attention. M3's design focuses on separating the tasks of selecting key-value (KV) pairs and computing attention, resulting in a streamlined process that employs block-level selection without compromising the expressive power of softmax attention. By adopting GQA over MLA and eliminating redundant branches, M3 achieves a balance between engineering efficiency and quality, aligning with the native sparse attention (NSA) principles. The design reflects a strategic choice to prioritize practical implementation speed and reusability of existing kernels over theoretical optimization, positioning MiniMax at the forefront of long-context open-source models as the industry standardizes around 1 million token contexts.