
I just wish that those who report on new models wouldn’t blindly parrot model creators’ unproved performance claims.
OpenAI and Epoch AI’s Math Benchmark Snafu
Last week it was revealed that OpenAI funded FrontierMath, one of the latest math benchmarks that was announced by AI research firm Epoch AI this past November. Regrettably, Epoch AI also did not disclose at that time that OpenAI actually commissioned this dataset of 300 advanced math problems for AI evaluation.
In what appeared to be a coordinated announcement, the same day OpenAI announced o3, Epoch AI shared o3’s impressive performance on the FrontierMath benchmark:
Excited to see @OpenAI's o3 evaluated on FrontierMath—one of the few benchmarks featured in their demo. From <2% solved by top models previously to O3's 25% success rate marks a significant leap. Learn more about the benchmark here: https://t.co/Iyi2l65b6P
— Epoch AI (@EpochAIResearch) December 21, 2024
To throw some tinder onto the flickering flame of this disclaimer gaffe, Epoch AI also failed to mention that OpenAI had access to the problems and solutions, with the exception of a 50-question holdout set, which they later admitted in their later clarification post.
Although this issue caused some understandable backlash in the AI community and the mathematicians who assisted in the project, they’re not the only ones to step in some deep yogurt with regard to dubious reporting. There’s a lot at stake with these benchmarks, and AI model creators are at times going to great lengths to look like they’ve taken a stronger lead than they actually have.
What’s unfortunate about this misstep though is, given o1’s impressive performance in the even newer Humanity’s Last Exam (HLE) benchmark, compared to most of their competitors, o3 quite possibly would have been the the top performer. In other words, OpenAI really doesn’t need to resort to subterfuge.
However, with the industry all abuzz over the performance of Chinese models coming from DeepSeek—which was hit by outages on its website yesterday after its AI assistant became the top-rated free application available on Apple’s App Store in the United States—I suspect OpenAI is feeling under the gun.
:upscale()/2014/01/24/837/n/1922398/f73673648bd7ace8_Tonya1.gif)
Issue with Self-Reporting in AI Leaderboards
When I first started pulling together my AI Strategy app, I naively assumed that organizations that managed AI benchmarks and their corresponding leaderboards took responsibility for running the evaluations. In reality, some benchmarks allow model owners to self-report, e.g., the MMLU Pro benchmark, which indicates test results that are self-reported in a dedicated ‘Data Source’ column.

I really appreciate this level of transparency and wish all leaderboards would disclose self-reporting as well as any ties they have with a particular model creator. But, at the same time, I struggle to understand why benchmark providers allow self-reporting at all. In the same way universities don’t allow prospective students to self-report standardized test scores or GPAs, why allow model providers to self-report any benchmarks? Organizations and app creators are actually making decisions based on these test results. Ergo, transparency should be a higher priority imo.
Having seen how model creators are sometimes prone to cutting corners and playing a little loose and free with their claims, as they barrel toward the goal of being the first to sticking their flag on the moon of artificial general intelligence (AGI), I now approach all leaderboards and model announcements with a healthy level of skepticism.

I just wish that those who report on new models wouldn’t blindly parrot model creators’ unproved performance claims.
Are Models Also Cheating?
The LiveCodeBench leaderboard highlights in red all models it suspects of ‘model contamination’ (i.e., training on the benchmark dataset after it’s made public). In any other context, this would be called cheating.

Although following its logic on how its creators arrive at this conclusion requires some mental gymnastics, I kinda love that they do this. Theirs is the only leaderboard I’ve come across that goes to this extent in their transparency goals.
Another slight they apply to these miscreant models is LiveCodeBench doesn’t award models suspected of model contamination a rank. They’re included in the sort, but they provide two distinct visual cues that their performance may not be entirely reliable—as indicated by the green highlights in the above screenshot.
That’s some data-driven tough love right there.

So What?
None of these details are smoking guns that model creators are intentionally being deceptive. Models that self-report or are suspected by one benchmark owner of model contamination should not, on these grounds alone, become anathema among AI practitioners and app developers. ☠️✋ A model’s performance just may not be able to match its hype. That said, claims that can’t be supported by statistically significant and transparently reported data should be approached with some skepticism and caution.
Research Further
You can stay abreast of AI news by following my AI Timeline (tutorial) and learn more about the performance benchmarks for math and complex reasoning tasks using this filtered view of my AI Strategy app (tutorial).
Let’s Discuss 💬
How important do you think it is for benchmark managers to report models they suspect have trained on their benchmark data or don’t submit their models to be evaluated by third parties? And do you think self-reporting performance metrics should be allowed?
Leave a Reply