In this study, the authors explore the use of online feedback in preference-based alignment methods, such as DPO, which have gained popularity as an alternative to reinforcement learning from human feedback (RLHF). They argue that online feedback is crucial for improving these methods and propose a method called online AI feedback (OAIF) that utilizes a language model as an annotator. The authors demonstrate through human evaluation that OAIF outperforms both offline DAP and RLHF methods in several tasks. They also highlight that the feedback in OAIF can be easily controlled through instruction prompts to the language model annotator.
https://arxiv.org/abs/2402.04792