Training and Aligning LLMs with RLHF and RLHF Alternatives

Reinforcement Learning with Human Feedback (RLHF) is an important part of training modern language models (LLMs) like ChatGPT and Llama 2. It allows for the incorporation of human preferences into the model’s optimization process, improving its helpfulness and safety. The LLM training pipeline consists of three steps: pretraining, supervised finetuning, and alignment. Pretraining involves training the model on vast unlabeled text datasets. Supervised finetuning refines the model using instruction-output pairs. Alignment further improves the model’s alignment with human preferences through RLHF. Llama 2 has some differences in its RLHF approach compared to InstructGPT, including the use of two reward models and rejection sampling. Although RLHF is worth the effort, alternative approaches are being researched, such as Constitutional AI and the Wisdom of Hindsight method.

To top