Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue"

Company

OpenPipe

Date Published

March 6, 2025

Author

Brad Hilton, Kyle Corbitt

Word count

2321

Language

English

Hacker News points

199

URL

openpipe.ai/blog/using-grpo-to-beat-o1-o3-mini-and-r1-on-temporal-clue

Summary

This is an investigation into using Group Relative Policy Optimization (GRPO) to train smaller, open-weight language models on complex deduction tasks. The authors achieved impressive performance gains by training Qwen 14B and 32B models on challenging Temporal Clue puzzles, bringing open-weight models to the cutting edge of reasoning performance at significantly reduced costs. By leveraging reinforcement learning and carefully selecting hyperparameters, they demonstrated that smaller, open-weight models can be trained to frontier-level accuracy, improving the cost-accuracy trade-off in logical deduction tasks.