pull down to refresh

"People often speculate about AI’s broader impact on society, but the clearest way to understand its potential is by looking at what models are already capable of doing."

In case you missed it, amidst all the talk about whether AI is plateauing, OpenAI released a new evaluation paper measuring AI progress versus human counterparts.
Previous AI evaluations like challenging academic tests and competitive coding challenges have been essential in pushing the boundaries of model reasoning capabilities, but they often fall short of the kind of tasks that many people handle in their everyday work.

"GDPval focuses on tasks based on deliverables that are either an actual piece of work or product that exists today or are a similarly constructed piece of work product. "

I like the idea of evaluating AIs on real world tasks, rather than made up tests. And it seems that graded them blind against human experts in the tasks' respective fields. Here are some examples of evaluation tasks:
To evaluate model performance on GDPval tasks, we rely on expert “graders”—a group of experienced professionals from the same occupations represented in the dataset. These graders blindly compare model-generated deliverables with those produced by task writers (not knowing which is AI versus human generated), and offer critiques and rankings.
we ran blind evaluations where industry experts compared deliverables from several leading models—GPT‑4o, o4-mini, OpenAI o3, GPT‑5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4—against human-produced work. Across 220 tasks in the GDPval gold set, we recorded when model outputs were rated as better than (“wins”) or on par with (“ties”) the deliverables from industry experts, as shown in the bar chart below. Claude Opus 4.1 was the best performing model in the set, excelling in particular on aesthetics (e.g., document formatting, slide layout), and GPT‑5 excelled in particular on accuracy (e.g., finding domain-specific knowledge). We also see clear progress over time on these tasks. Performance has more than doubled from GPT‑4o (released spring 2024) to GPT‑5 (released summer 2025), following a clear linear trend.
I read these results as saying that Claude Opus is on its way to being as good as a human at real world tasks -- although Zvi Mowshowitz points out: "Crossing 50% does not mean you are better than a human even at the included tasks, since the AI models will have a higher rate of correlated, stupid or catastrophic failure."
Here's an interesting X thread about the project: https://x.com/tejalpatwardhan/status/1971249532588741058
Also, OpenAI is publishing all their evaluation papers at evals.openai.com
287 sats \ 1 reply \ @optimism 5h
"Crossing 50% does not mean you are better than a human even at the included tasks, since the AI models will have a higher rate of correlated, stupid or catastrophic failure."
This is why I feel that this is all a sales pitch. Also, I don't hire in the bottom 50%. I hire in the top 2%. Get 100 resumes burn 98, invite 2, hire 1.
The other thing is that I'd be Gell-Mann-amnesia-style betraying my own conscience by believing this, as just this morning I got code that didn't work when I tried something. And it wasn't even that hard to do it right. So expert level? No. Only if you are a lil yolo bitch with a big mouth on twitter that calls themselves an expert. In that case, you shall lose your internet credits. Preferably yesterday.
reply
Well, a lot of this really depends on who the "expert" humans in the blond test were.
I read 50% on the chart to mean that it is a coin toss to know whether graders thoought the human or the ai did better work. Less than 50% means graders tended to rank ai as doing less good work than the humans. Greater than 50% means they tended to rank ai as doing better work than humans.
So the important factor is were the humans the ai was graded against "top 2%" kind of people.
Also, the point about ai failure being more likely tonne catastrophic is valid.
Finally, I'd say I have no doubt that openAI is pumping their own bags with a sales pitch in every piece of info they put out. But even so, there is something here.
It feels to me like when social media was bursting onto the scene. I mostly dismissed it because I didn't see the utility and I didn't trust the promoters. Yet, lately I come and lately I see that there may be some utility here. It may be an open question whether it is a net benefit, but it certainly is a powerful tool to do something. I see ai in the same light (and perhaps I'm just scared of repeating what I now see as a mistake in my attitude toward social media).
reply
This is the key sentence that kinda makes this evaluation not super useful:
Additionally, in the real world, tasks aren’t always clearly defined with a prompt and reference files; for example, a lawyer might have to navigate ambiguity and talk to their client before deciding that creating a legal brief is the right approach to help them. We plan to expand GDPval to include more occupations, industries, and task types, with increased interactivity, and more tasks involving navigating ambiguity, with the long-term goal of better measuring progress on diverse knowledge work.
Part of the human expert's work is to define the problem and collect the relevant information needed. The AI didn't have to do any of that.
Moreover, the article didn't talk about whether the AI's work product was actually put into a productionized environment. For example, were the real estate listings actually posted automatically onto Redfin/Zillow? Another part of the human's work is to navigate the many different tools and platforms and conform inputs and outputs to the expected format, and interoperate between many technologies. Not sure if the AI can do that autonomously yet.
reply