Offline Reinforcement Learning for LLM Multi-Step Reasoning

In this paper, we introduce the OREO (Offline Reasoning Optimization) method, which enhances the multi-step reasoning ability of large language models (LLMs) using offline reinforcement learning. Unlike Direct Preference Optimization (DPO), which struggles with multi-step reasoning tasks due to the lack of paired preference data and uniform treatment of all tokens, OREO overcomes these limitations by jointly learning a policy model and value function through the soft Bellman Equation. Empirically, OREO outperforms existing offline learning methods on various benchmarks, including mathematical reasoning tasks and embodied agent control. Additionally, OREO can be extended to a multi-iteration framework and guide tree search to improve performance during testing.

https://arxiv.org/abs/2412.16145