Home / Companies / OpenPipe / Blog / Post Details
Content Deep Dive

Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue"

Blog post from OpenPipe

Post Details
Company
Date Published
Author
Brad Hilton, Kyle Corbitt
Word Count
2,321
Language
English
Hacker News Points
199
Summary

This is an investigation into using Group Relative Policy Optimization (GRPO) to train smaller, open-weight language models on complex deduction tasks. The authors achieved impressive performance gains by training Qwen 14B and 32B models on challenging Temporal Clue puzzles, bringing open-weight models to the cutting edge of reasoning performance at significantly reduced costs. By leveraging reinforcement learning and carefully selecting hyperparameters, they demonstrated that smaller, open-weight models can be trained to frontier-level accuracy, improving the cost-accuracy trade-off in logical deduction tasks.