OpenAI's New GDPval Benchmark Reveals AI's Rapid Progress Towards Human-Level Expertise

September 25, 2025

Tech

AI Research

On the gold subset, advanced models are approaching expert-level quality in many tasks, with ongoing improvements and identified error types such as instruction-following issues and hallucinations.
OpenAI has introduced a new benchmark called GDPval to evaluate AI performance across various industries and jobs, aiming to gauge how close AI models are to surpassing human expertise in economically valuable tasks.
In initial tests, Claude Opus 4.1 outperformed GPT-5, achieving not inferior ratings to human experts in 49% of tasks, partly because it produces more aesthetically pleasing charts, marking a significant improvement from GPT-4o's 13.7% win/tie rate just 15 months earlier.
Experts suggest that AI is more likely to augment human roles by offloading routine tasks, allowing professionals to focus on strategic, creative, or interpersonal responsibilities rather than replacing jobs entirely.
The GDPval evaluation framework includes scenario analyses comparing human-only workflows with model-assisted processes, revealing potential reductions in time and costs, especially in repetitive or complex tasks.
These rapid advancements highlight the swift progress in AI capabilities, with models like GPT-5 and Claude Opus 4.1 approaching expert-level performance across a range of tasks.
Despite impressive scores, OpenAI acknowledges that current benchmarks like GDPval only cover a limited scope of real-world jobs and primarily focus on report generation, with plans to develop more comprehensive assessments.
The rapid improvements in benchmarks like GDPval suggest that the gap between AI and human expertise is narrowing faster than expected, pushing AI development toward more sophisticated and versatile systems.
OpenAI emphasizes that GDPval complements existing assessment tools by incorporating multi-modal, occupationally relevant tasks, and plans to expand coverage in future versions to better reflect real-world job complexities.
The goal of this approach is to determine whether AI can replace or augment human workflows in practical, profit-driven tasks such as legal drafting and healthcare reporting.
OpenAI's leadership interprets these advancements as evidence that AI can help professionals focus on higher-value work by offloading routine tasks, with models already approaching industry expert quality on certain tasks.
An automated judging system shows about 66% agreement with human experts, serving as a scalable proxy for rapid iteration, though it is not a substitute for human judgment.
Models can complete GDPval tasks roughly 100 times faster and cheaper than humans, although this does not fully account for the complexities of oversight and iterative work in real-world scenarios.

Summary based on 9 sources

Get a daily email with more Tech stories

Sources

TechCrunch • Sep 25, 2025

OpenAI says GPT-5 stacks up to humans in a wide range of jobs

ZDNET • Sep 25, 2025

OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising

Slashdot • Sep 25, 2025

OpenAI Says GPT-5 Stacks Up To Humans in a Wide Range of Jobs - Slashdot

MarkTechPost • Sep 25, 2025

OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on Real-World Economically Valuable Tasks

OpenAI's New GDPval Benchmark Reveals AI's Rapid Progress Towards Human-Level Expertise

Get a daily email with more Tech stories

Sources

More Stories