Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets
Blog post from Together AI
Together AI has partnered with MiniMax to serve the newly launched MiniMax M3 model, a state-of-the-art model featuring advanced capabilities such as a 1M-token context window, native multimodality, and agentic workflow support. Together AI's Inference and Kernel teams have achieved significant engineering breakthroughs to efficiently serve M3 at scale, including the development of a KV-Block-Major sparse attention kernel, a novel paged attention integration for MSA, and a Rust-based multimodal preprocessing gateway. These optimizations have resulted in throughput improvements of 81-125% across different concurrency levels. The M3 model introduces MiniMax Sparse Attention (MSA) to overcome attention-computation bottlenecks, significantly boosting performance in long-context workloads. The collaboration between Together AI and MiniMax addresses engineering challenges such as supporting the 1M context length and optimizing multimodal processing. Additionally, Together AI will host the open-weights model, facilitating developer access upon its public release.