Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets

Post Details

Company

Together AI

Date Published

June 2, 2026

Author

Together AI

Word Count

1,652

Company Posts That Month

5

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.together.ai/blog/serving-minimax-m3-for-efficient-inference-unlocking-1m-token-context-and-multimodality-without-regrets

Summary

Together AI has partnered with MiniMax to serve the newly launched MiniMax M3 model, a state-of-the-art model featuring advanced capabilities such as a 1M-token context window, native multimodality, and agentic workflow support. Together AI's Inference and Kernel teams have achieved significant engineering breakthroughs to efficiently serve M3 at scale, including the development of a KV-Block-Major sparse attention kernel, a novel paged attention integration for MSA, and a Rust-based multimodal preprocessing gateway. These optimizations have resulted in throughput improvements of 81-125% across different concurrency levels. The M3 model introduces MiniMax Sparse Attention (MSA) to overcome attention-computation bottlenecks, significantly boosting performance in long-context workloads. The collaboration between Together AI and MiniMax addresses engineering challenges such as supporting the 1M context length and optimizing multimodal processing. Additionally, Together AI will host the open-weights model, facilitating developer access upon its public release.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.