
The Deep Research Arms Race
In December Google announced Deep Research, its new agentic feature that allows Gemini Advanced subscribers to perform online research. Then on Sunday OpenAI announced its deep research agentic tool but restricted access initially to Pro users. Both have similar claims: Their agents create multi-step research plans, with trajectories to find the data it needs, backtracking and reacting to real-time information where necessary deeply analyzing relevant information from across the web on your behalf.
But then two days after OpenAI’s big announcement, Hugging Face basically told both OpenAI and Google to hold their 🍺 because they were introducing an open source alternative to both tools.
Price Comparison
Let’s break these deep research options down by price tag:
- OpenAI: $200/mo
- Google: $19.99/mo
- Hugging Face: free
I monitor AI innovations and announcements very closely (and so can you), and it’s not often you see a price disparity of this magnitude.
Quality Comparison
Google didn’t make any performance claims in their announcement, so we’ll focus on OpenAI and Hugging Face.
OpenAI’s Claims
OpenAI mentioned two different benchmarks on its landing page, so let’s break them down separately.
GAIA Benchmark
OpenAI claimed that its agent performed an impressive average score of 72.57% on the GAIA benchmark. So I went to the GAIA leaderboard to verify this score. I first checked the current high score, which is 65.12% and held by the h2oGPTe_Agent_v1.6.8 (built by the h20.ai team on the Claude 3.5 Sonnet model). OpenAI’s deep research agent would have appeared at the top of the table if they had submitted it for evaluation, but it was strangely absent.
Wanting to be cautious in my speculations, I reached out to a member of the GAIA team, and they confirmed that OpenAI did not submit their agent for testing. They also confirmed that the OpenAI team would’ve had to use GAIA’s validation dataset, which is public (both the questions and the answers). I highlight a sample question below.
So we’re relying on the honor system here. But at what point does trust cross a threshold into gullibility? Imagine a teacher providing students with an exam that has the answers at the bottom of the page and trusting them not to look at them. Now make those students answerable to a board that’s freaking out about their falling stock prices.
I’m not accusing the OpenAI team of training their agent on the GAIA validation dataset or misrepresenting their deep research tool’s performance. However, at minimum, the optics in announcing unverified benchmark results aren’t great. This is an issue I’ve hit on before in a separate issue with OpenAI.
Humanity’s Last Exam Benchmark
OpenAI also claimed they scored a high score of 26.6% on the notoriously challenging Humanity’s Last Exam (HLE) benchmark—and even added their score to a replica of HLE’s leaderboard, which lent more credibility to it imo…whether that was intentional or not.
So I went to the HLE leaderboard to verify its score, and its agent pulled a von Trapp family disappearing act…again.
Like the GAIA benchmark, HLE has released its validation dataset publicly, including answers. I highlight a sample question below.
And also like GAIA, their team has kept their test set private to minimize the risk of model contamination (source). So it appears that the OpenAI team also ran their deep research tool through the publicly available validation set.
I’ve reached out to the HLE authors for verification and will report back if I get a response. But, if past is prologue, it would again mean that the OpenAI team is relying heavily on the honor system—and without disclosing any of these reporting caveats.
What really doesn’t help is even publications I genuinely respect report these claims as fact, creating an echo chamber of mag props. I’m really not trying to be pedantic about these claims. I just work on projects and see how project leads take these claims to heart and factor them into purchasing decisions.
Hugging Face’s Claims
Hugging Face claimed that their agent scored a 55.15% on the GAIA benchmark. However, I knew that their open-DeepResearch model wouldn’t appear on the GAIA leaderboard because they disclosed that they ran their test on the validation set. They also made it clear that they were continuing to tweak their tool, so I imagine when they feel comfortable with its performance, they’ll eventually submit it to be officially evaluated against GAIA’s test dataset.
Transparency Is Key
As I stated in my previous post on this issue of transparency, this issue of making unverified claims without qualification is reaching a fever pitch. To be fair, it isn’t just an OpenAI issue—not in the least. It’s a much more general issue in the AI space. Right now the stakes are high, and model developers are desperate to find ways to pull away from the pack.
I also think the DeepSeek R1 rage, which sent the US stock market into a spiral last week, is also playing into some of these flexes.
Hopefully in time there will be greater understanding of the challenges of measuring model performance, and publishers will start holding model developers’ feet to the flames a bit more to nudge them toward walking a little more circumspectly through these knee-jerk announcements.
Learn More
In the meantime, you may want to, at minimum, monitor future announcements about deep research initiatives in my AI Timeline (filtered for the ‘deep research’ tag), if you have any inclination to use one of these agentic research tools.

And if you want to learn more about the HLE and GAIA leaderboards, you can click on their nodes in this filtered view of my AI Strategy app.
Finally, you can check out these other open-source alternatives to these proprietary deep research tools.
Leave a Reply