DeepSeek has introduced two openly licensed models, DeepSeek-R1-Zero and DeepSeek-R1, that challenge the conventional reliance on supervised data for training language models by leveraging reinforcement learning (RL) to achieve high reasoning capabilities. These models rival OpenAI's performance on formal math and STEM benchmarks, with DeepSeek-R1-Zero reaching significant accuracy improvements solely through RL, showcasing emergent behaviors like self-verification. In contrast, DeepSeek-R1 integrates a brief supervised fine-tuning phase to enhance language consistency and usability, achieving similar accuracy while offering more polished outputs. DeepSeek's five-stage training methodology, which includes a cold-start data collection, reasoning-oriented RL, rejection sampling, multi-domain supervised fine-tuning, and final RL, illustrates that a combination of rule-based rewards and carefully structured training stages can effectively balance raw performance with production readiness. This approach not only democratizes advanced reasoning capabilities by making them accessible with moderate computing resources but also highlights the potential of RL over traditional supervised methods in driving significant advances in language model reasoning.