pull down to refresh

This is the key sentence that kinda makes this evaluation not super useful:
Additionally, in the real world, tasks aren’t always clearly defined with a prompt and reference files; for example, a lawyer might have to navigate ambiguity and talk to their client before deciding that creating a legal brief is the right approach to help them. We plan to expand GDPval to include more occupations, industries, and task types, with increased interactivity, and more tasks involving navigating ambiguity, with the long-term goal of better measuring progress on diverse knowledge work.
Part of the human expert's work is to define the problem and collect the relevant information needed. The AI didn't have to do any of that.
Moreover, the article didn't talk about whether the AI's work product was actually put into a productionized environment. For example, were the real estate listings actually posted automatically onto Redfin/Zillow? Another part of the human's work is to navigate the many different tools and platforms and conform inputs and outputs to the expected format, and interoperate between many technologies. Not sure if the AI can do that autonomously yet.