30% Drop In o1-Preview Accuracy When Putnam Problems Are Slightly Variated

The Putnam-AXIOM benchmark challenges large language models (LLMs) in mathematical reasoning, exposing gaps in reasoning performance and data contamination’s impact. While current benchmarks evaluating LLMs are reaching saturation, Putnam-AXIOM stands out with 236 complex math problems from the Putnam Competition and a Variation benchmark with 52 problems offering novel challenges. By altering problem elements programmatically, unique problems are generated. Surprisingly, models struggle with accuracy in the variations compared to the original problems, with OpenAI’s o1-preview achieving only 41.95% accuracy on the Original benchmark and experiencing a 30% drop in accuracy with the variations. (Word count: 98)

https://openreview.net/forum?id=YXnwlZe0yf¬eId=yrsGpHd0Sf

To top