How RLHF Works

Reinforcement learning from human feedback (RLHF) can lead to impressive results, but its implementation is challenging due to issues with measuring its impact and optimizing complex data landscapes. RLHF works when there is a signal that supervised learning alone cannot handle, and a capable base model that can follow instructions is available. The preference data used to train RLHF is hard to collect, label, and process, making it a time-consuming process. RLHF can also be limited by the prompt distribution and requires a matching distribution for preference models and RLHF. Scaling can significantly improve RLHF, but it remains difficult to experiment with due to limited access to compute. Nevertheless, RLHF remains an essential step towards creating models that people love, and researchers are continually exploring new optimization techniques and data practices.

To top