MiniMax M2.5: Intelligence too cheap to meter, RL process rewards, real-world productivity

Post Details

Company

Baseten

Date Published

Feb. 14, 2026

Author

Alex Ker

Word Count

902

Language

English

Hacker News Points

-

Source URL

www.baseten.co/blog/minimax-m2-5-intelligence-too-cheap-to-meter-rl-process-rewards-real-world-produc

Summary

MiniMax M2.5 addresses the limitations of traditional reinforcement learning (RL) by implementing a per-step process reward system that preserves intermediate signals and enhances agent performance over long trajectories. This innovative approach allows the model to achieve state-of-the-art benchmarks in tasks like coding and tool use, significantly reducing costs compared to closed-source models. By refining the Clipped Important Sampling Policy Optimization (CISPO) method and introducing token-specific rewards that optimize for both speed and quality, M2.5 tackles the credit assignment problem effectively, ensuring each action's contribution to the final outcome is recognized. The model demonstrates impressive generalization capabilities, outperforming competitors like Opus 4.6 on out-of-distribution tasks in the SWE-Bench Verified evaluation, while its affordability makes it a cost-effective solution for complex real-world tasks in finance, law, and social sciences.