Training Language Models to Self-Correct via Reinforcement Learning

A team of researchers has developed a novel approach called SCoRe to improve the self-correction capability of large language models (LLMs) without the need for multiple models or external supervision. Existing methods for training self-correction have been ineffective in modern LLMs. SCoRe utilizes online reinforcement learning to train the model using self-generated data, addressing challenges such as distribution mismatch and limited correction behavior modes. By applying SCoRe to Gemini 1.0 Pro and 1.5 Flash models, the researchers achieved significant improvements in self-correction performance on various benchmarks. This innovative approach paves the way for enhancing the performance of LLMs in self-correction tasks.

https://arxiv.org/abs/2409.12917

To top