MiniMax M2.5: Intelligence too cheap to meter, RL process rewards, real-world productivity
Blog post from Baseten
MiniMax M2.5 addresses the limitations of traditional reinforcement learning (RL) by implementing a per-step process reward system that preserves intermediate signals and enhances agent performance over long trajectories. This innovative approach allows the model to achieve state-of-the-art benchmarks in tasks like coding and tool use, significantly reducing costs compared to closed-source models. By refining the Clipped Important Sampling Policy Optimization (CISPO) method and introducing token-specific rewards that optimize for both speed and quality, M2.5 tackles the credit assignment problem effectively, ensuring each action's contribution to the final outcome is recognized. The model demonstrates impressive generalization capabilities, outperforming competitors like Opus 4.6 on out-of-distribution tasks in the SWE-Bench Verified evaluation, while its affordability makes it a cost-effective solution for complex real-world tasks in finance, law, and social sciences.