Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”

In this post, we explore how Group Relative Policy Optimization (GRPO) was used to improve reasoning models in a game called “temporal clue”, exceeding previous models like o1, o3-mini, and coming close to Sonnet 3.7 while being cost-effective. The goal was to enhance logical deduction by training smaller models iteratively. By using reinforcement learning techniques, the models were trained to generate multiple responses for each puzzle. The training process involved steps to grade responses, fine-tune the model, and repeat the process with new puzzles. The results showed significant gains in deductive reasoning and cost-effectiveness, showcasing the potential of reinforcement learning for training open-weight models. The content is made freely available for replication and improvement.

https://openpipe.ai/blog/using-grpo-to-beat-o1-o3-mini-and-r1-on-temporal-clue

To top