This study explores how reinforcement learning (RL) can enhance language models’ ability to tackle complex problems, using the game of Countdown as an example. The Qwen model shows superior performance compared to Llama, mainly due to its natural inclination towards crucial cognitive behaviors such as verification and subgoal setting. By priming Llama with examples containing these behaviors, significant improvements are observed during RL training. Surprisingly, models primed with incorrect solutions but proper reasoning patterns achieve similar results to those trained on correct solutions. Continued pretraining with reasoning-focused data further boosts Llama’s performance, highlighting the importance of initial reasoning behaviors in a model’s self-improvement trajectory.
https://arxiv.org/abs/2503.01307