"People often speculate about AI’s broader impact on society, but the clearest way to understand its potential is by looking at what models are already capable of doing."
In case you missed it, amidst all the talk about whether AI is plateauing, OpenAI released a new evaluation paper measuring AI progress versus human counterparts.
Previous AI evaluations like challenging academic tests and competitive coding challenges have been essential in pushing the boundaries of model reasoning capabilities, but they often fall short of the kind of tasks that many people handle in their everyday work.
"GDPval focuses on tasks based on deliverables that are either an actual piece of work or product that exists today or are a similarly constructed piece of work product. "
I like the idea of evaluating AIs on real world tasks, rather than made up tests. And it seems that graded them blind against human experts in the tasks' respective fields. Here are some examples of evaluation tasks:
To evaluate model performance on GDPval tasks, we rely on expert “graders”—a group of experienced professionals from the same occupations represented in the dataset. These graders blindly compare model-generated deliverables with those produced by task writers (not knowing which is AI versus human generated), and offer critiques and rankings.
we ran blind evaluations where industry experts compared deliverables from several leading models—GPT‑4o, o4-mini, OpenAI o3, GPT‑5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4—against human-produced work. Across 220 tasks in the GDPval gold set, we recorded when model outputs were rated as better than (“wins”) or on par with (“ties”) the deliverables from industry experts, as shown in the bar chart below. Claude Opus 4.1 was the best performing model in the set, excelling in particular on aesthetics (e.g., document formatting, slide layout), and GPT‑5 excelled in particular on accuracy (e.g., finding domain-specific knowledge). We also see clear progress over time on these tasks. Performance has more than doubled from GPT‑4o (released spring 2024) to GPT‑5 (released summer 2025), following a clear linear trend.
I read these results as saying that Claude Opus is on its way to being as good as a human at real world tasks -- although Zvi Mowshowitz points out: "Crossing 50% does not mean you are better than a human even at the included tasks, since the AI models will have a higher rate of correlated, stupid or catastrophic failure."
Here's an interesting X thread about the project:
https://x.com/tejalpatwardhan/status/1971249532588741058
Also, OpenAI is publishing all their evaluation papers at evals.openai.com