How DeepSeek's RL Approach Achieves 79.8% AIME Performance

Company

Galileo

Date Published

July 25, 2025

Author

Conor Bronsdon

Word count

1752

Language

English

Hacker News points

None

URL

galileo.ai/blog/deepseek-reinforcement-learning

Summary

DeepSeek has introduced two openly licensed models, DeepSeek-R1-Zero and DeepSeek-R1, that challenge the conventional reliance on supervised data for training language models by leveraging reinforcement learning (RL) to achieve high reasoning capabilities. These models rival OpenAI's performance on formal math and STEM benchmarks, with DeepSeek-R1-Zero reaching significant accuracy improvements solely through RL, showcasing emergent behaviors like self-verification. In contrast, DeepSeek-R1 integrates a brief supervised fine-tuning phase to enhance language consistency and usability, achieving similar accuracy while offering more polished outputs. DeepSeek's five-stage training methodology, which includes a cold-start data collection, reasoning-oriented RL, rejection sampling, multi-domain supervised fine-tuning, and final RL, illustrates that a combination of rule-based rewards and carefully structured training stages can effectively balance raw performance with production readiness. This approach not only democratizes advanced reasoning capabilities by making them accessible with moderate computing resources but also highlights the potential of RL over traditional supervised methods in driving significant advances in language model reasoning.

How DeepSeek's RL Approach Achieves 79.8% AIME Performance | Galileo

Summary