FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI

The authors introduce FrontierMath, a benchmark of original mathematics problems that challenge AI systems to demonstrate advanced reasoning capabilities. While leading AI models excel on traditional benchmarks like GSM-8K and MATH, they struggle to solve less than 2% of the FrontierMath problems, highlighting a significant gap between current AI capabilities and the expertise of the mathematics community. Fields Medalists found the problems extremely challenging. The problems are carefully designed to test genuine mathematical understanding, are “guessproof,” and are peer-reviewed by expert mathematicians. Despite extensive support, current AI models struggle with the problems, with even the best models solving less than 2%. The next steps involve regular evaluations, benchmark expansion, public problem release, and enhanced quality assurance to further evaluate AI’s mathematical reasoning abilities.

https://epochai.org/frontiermath/the-benchmark