Benchmarks are quizzes to measure language models such as ChatGPT. Think of them as the SAT exam for AIs. Much like the SATs, AIs cheat on them. Also like the SATs, they are misrepresented by people who use them.
GPQA, or “Graduate-Level Google-Proof Q&A Benchmark“ is a set of difficult exam questions designed to be, as the title suggests, “Google-proof”. They show levels of information retrieval beyond existing search engines.
How do the authors make their questions “Google-proof”? By making sure “highly skilled and incentivized non-experts” (human PhD candidates in other domains) cannot solve them.
Experts achieve 65% accuracy, and many of their errors arise not from disagreement over the correct answer to the question, but mistakes due to the question’s sheer difficulty (when accounting for this conservatively, expert agreement is 74%). In contrast, our non-experts achieve only 34% accuracy
That’s where the representation problems begin. is billed as a “reasoning” benchmark by many, including most famously Anthropic, who lists GPQA as its first benchmark. This is evidently false from the GPQA methodology itself. The whole point of this benchmark is that generalists cannot solve the problem. It is by definition a benchmark of not just domain expertise, but highly specific domain expertise.
Of course, when used and understood properly, GPQA is helpful. It’s also listed among OpenAI, Meta, and Google product launches, where it is treated with a more responsible (or simply no) description. However, use of the benchmark to imply that language models have above-human “reasoning” are simply incorrect. Language models still have a long way to go and many bottlenecks and diminishing returns in their way.
That was an excellent organic chemistry question, however. I suspect that most organic chemists with a PhD (I have one) would never have come up with 16 without having to explicitly write out all the racemates (and maybe not even then), or even understood why there were 4 dienes rather than just three since there are only 3 isomers of methyl cyclopentadiene.