The website provides insights into R1-Zero-Like Training, discussing base models like Qwen2.5 and reinforcement learning techniques like Dr. GRPO. The authors reveal a minimalist recipe for training Qwen2.5-Math-7B with Dr. GRPO, achieving top performance in just 27 hours. They caution against mismatched templates affecting reasoning abilities and propose improvements like using domain-specific pretraining to enhance RL outcomes. Controversially, GRPO can lead to biased optimization, but Dr. GRPO corrects this bias. The site offers detailed instructions on implementing R1-Zero training using Oat’s framework. Surprisingly, even small question sets can enhance reasoning abilities when used correctly.
https://github.com/sail-sg/understand-r1-zero