Weak-to-Strong Generalization

The Superalignment team has released a paper proposing a new research direction for aligning future superhuman AI systems. The challenge lies in the fact that humans will need to supervise AI systems that are much smarter than them. To address this, the team suggests using smaller models to supervise larger ones. Surprisingly, they find that a GPT-2-level model can elicit most of GPT-4’s capabilities, demonstrating weak-to-strong generalization. This provides a starting point for tackling the central challenge of aligning superhuman models. While there are still limitations and disanalogies to overcome, the team believes this research direction offers promising opportunities for making progress on AI alignment. They even offer open-source code and a grants program to support further research in this area.

https://openai.com/research/weak-to-strong-generalization