OpenAI's New GDPval Benchmark Reveals AI's Rapid Progress Towards Human-Level Expertise

September 25, 2025
OpenAI's New GDPval Benchmark Reveals AI's Rapid Progress Towards Human-Level Expertise
  • On the gold subset, advanced models are approaching expert-level quality in many tasks, with ongoing improvements and identified error types such as instruction-following issues and hallucinations.

  • OpenAI has introduced a new benchmark called GDPval to evaluate AI performance across various industries and jobs, aiming to gauge how close AI models are to surpassing human expertise in economically valuable tasks.

  • In initial tests, Claude Opus 4.1 outperformed GPT-5, achieving not inferior ratings to human experts in 49% of tasks, partly because it produces more aesthetically pleasing charts, marking a significant improvement from GPT-4o's 13.7% win/tie rate just 15 months earlier.

  • Experts suggest that AI is more likely to augment human roles by offloading routine tasks, allowing professionals to focus on strategic, creative, or interpersonal responsibilities rather than replacing jobs entirely.

  • The GDPval evaluation framework includes scenario analyses comparing human-only workflows with model-assisted processes, revealing potential reductions in time and costs, especially in repetitive or complex tasks.

  • These rapid advancements highlight the swift progress in AI capabilities, with models like GPT-5 and Claude Opus 4.1 approaching expert-level performance across a range of tasks.

  • Despite impressive scores, OpenAI acknowledges that current benchmarks like GDPval only cover a limited scope of real-world jobs and primarily focus on report generation, with plans to develop more comprehensive assessments.

  • The rapid improvements in benchmarks like GDPval suggest that the gap between AI and human expertise is narrowing faster than expected, pushing AI development toward more sophisticated and versatile systems.

  • OpenAI emphasizes that GDPval complements existing assessment tools by incorporating multi-modal, occupationally relevant tasks, and plans to expand coverage in future versions to better reflect real-world job complexities.

  • The goal of this approach is to determine whether AI can replace or augment human workflows in practical, profit-driven tasks such as legal drafting and healthcare reporting.

  • OpenAI's leadership interprets these advancements as evidence that AI can help professionals focus on higher-value work by offloading routine tasks, with models already approaching industry expert quality on certain tasks.

  • An automated judging system shows about 66% agreement with human experts, serving as a scalable proxy for rapid iteration, though it is not a substitute for human judgment.

  • Models can complete GDPval tasks roughly 100 times faster and cheaper than humans, although this does not fully account for the complexities of oversight and iterative work in real-world scenarios.

Summary based on 9 sources


Get a daily email with more Tech stories

More Stories