Skip to main content

Cookie Consent

We use cookies to enhance your browsing experience, serve personalised ads or content, and analyse our traffic. Learn more

Install AIinASIA

Get quick access from your home screen

AI in ASIA
GPT-5 GDPval benchmark
Create

What is GDPval - and why it matters

While GPT-5 shows notable progress, matching or outperforming experts in over 40% of tasks, the benchmark has limitations, capturing only narrow, one-shot deliverables rather than full job complexity. We unpack the implications for business and human-AI collaboration.

Intelligence Desk6 min read

AI Snapshot

The TL;DR: what matters, fast.

GDPval is a new OpenAI benchmark evaluating AI models against human professionals in 44 occupations across 9 industries.

GPT-5 achieves a "win or tie" rate of 40.6% on GDPval, indicating its ability to assist meaningfully in professional tasks.

Anthropic's Claude Opus 4.1 outperforms GPT-5 on GDPval with a win or tie rate of approximately 49%.

Who should pay attention: AI developers | Economists | Business leaders

What changes next: The capabilities of large language models will continue to be a key area of research.

A new benchmark suggests we’re inching toward AI doing parts of real jobs — but the full picture is still far from clear.

Could GPT‑5 already be doing the work of a software engineer, lawyer or nurse at least part of it? That is the provocative claim behind GDPval, a new evaluation by OpenAI that pits its models against human professionals across 44 occupations. The early results are striking, but they require nuance. This is not about AI replacing humans just yet it’s about measuring whether AI can already assist at a professional level.

GDPval is a new benchmark testing AI on real‑world deliverables (reports, blueprints, briefs) in 44 occupations across nine key industries. On this benchmark, GPT‑5 (high configuration) is rated “as good as or better than experts” about 40.6 % of the time. Anthropic’s Claude Opus 4.1 outperforms GPT‑5 here. It wins or ties ~49 % of the time. But GDPval’s current form is one‑shot, narrow in scope. It doesn’t capture iterative workflows, ambiguity, stakeholder interaction or long projects. The key takeaway: GPT‑5 is moving into territory where it can assist meaningfully in professional tasks. But full substitution of human roles remains distant.

What Is GDPval — and Why It Matters

OpenAI describes GDPval as an evaluation of economically valuable, real‑world tasks drawn from roles in industries that contribute heavily to GDP. OpenAI It differs from classic benchmarks (math puzzles, multiple choice, synthetic tests) by asking models to generate deliverables — documents, diagrams, slides, plans — based on realistic context and reference files. OpenAI+1

The benchmark covers 44 occupations (software development, engineering, nursing, legal, financial analysis, among others) across nine sectors. OpenAI+1 For each task, human graders (experts in the same domain) compare AI output with a human expert’s version (blind) and rate whether the AI’s output is better, as good, or worse. OpenAI+1

Its ambition: to shift evaluation of AI from isolated puzzles to work‑relevant performance. If a model can already pass parts of what professionals do, it changes how businesses adopt and trust AI.

What the Results Show (and Don’t)

Encouraging gains, but not dominance

GPT‑5’s “win or tie” rate — ~40.6 % in its “high” mode — is a big leap over previous models. By comparison, GPT‑4o scored ~13.7 %.

But even 40 % is not “AI wins most of the time.” In many tasks, it still trails human experts. The benchmark is more about getting close in selected domains than sweeping dominance.

⚖️ Claude outpaces GPT‑5 in this test

Claude Opus 4.1 achieves ~49 % win/tie rate on the same benchmark — a notable margin over GPT‑5. OpenAI suggests Claude may benefit from stylistic and formatting appeal (nicer graphics, layout) in the judging, not purely content superiority.

This underlines that presentation matters in these comparisons, not just factual correctness or reasoning depth.

Limitations are significant

One‑shot format: each task is judged in a single pass, without room for revision, feedback loops or back-and-forth with stakeholders.,Narrow scope of “job work”: many professional roles involve ambiguity, negotiation, collaboration, evolving constraints, meetings, client interaction — none of which are captured in GDPval‑v0.,Judging bias: visual polish, formatting, readability might influence human graders. That can advantage models which produce clean layouts, not necessarily deeper insight.,Not generalisable to every role: tasks are drawn from 44 occupations — many roles or tasks outside those are untested.

So while impressive, the results are suggestive, not conclusive.

Why This Matters (for Business, AI Adoption, Strategy)

1. Productivity uplift, not wholesale displacement

Even if GPT‑5 can’t replace a professional entirely, being able to reliably draft or assist certain deliverables is immensely valuable. Professionals can offload parts of the workflow and focus on judgment, oversight, strategy, ethics. For more on this, check out our piece on what every worker needs to answer: What is your non-machine premium?.

OpenAI itself frames the narrative this way: as models improve, workers can “offload some of their work to the model and do higher‑value work.”

2. Deployment will be selective and domain-specific

Firms will first adopt AI in tasks that are well-defined, structured and lower risk. E.g. generating reports, summarising data, drafting first passes of legal memos. As models prove reliable, they’ll move into more complex areas.

3. Quality control and human oversight remain crucial

Even when AI output looks plausible, errors, hallucinations or context misunderstanding can creep in. Especially in domains like law, medicine, engineering, an errant detail can be costly. Any deployment must include checks, correction workflows, human-in-the-loop review. This highlights the importance of when AI slop needs a human polish.

4. Competitive landscape and hybrid strengths

Claude’s edge in GDPval suggests that different models may dominate different niches (style, presentation, domain depth). Organisations will want to choose or combine models based on their domain demands not assume one “super‑model” wins everywhere. Our article Perplexity vs ChatGPT vs Gemini - five challenges, three contenders explores such comparisons.

5. Benchmark evolution will redefine “capable AI”

GDPval itself is just version 0. OpenAI plans future iterations with interactive workflows, longer tasks, ambiguity, interactivity. Over time, if a model can handle full project cycles, negotiation, error recovery and human collaboration, that’s when we cross a more meaningful threshold.

Evidence from Domain Benchmarks: Medical & Multimodal Gains

Beyond GDPval, GPT‑5 is also assertively improving in technical, domain-intensive evaluations:

In medical reasoning and imaging, GPT‑5 outperforms GPT‑4o in radiology, treatment planning, visual question answering tasks. In a board‑style physics exam subset, GPT‑5 achieved ~90.7 % accuracy vs ~78 % for GPT‑4o, surpassing human pass thresholds.,In ophthalmology QA, GPT‑5-high achieved ~96.5 % accuracy on a specialist dataset, outperforming GPT‑4o and other variants.,In domain integration tasks combining images + text, GPT‑5 shows gains over GPT‑4o in multi‑modal reasoning benchmarks

These results confirm that GPT‑5’s improvements are not just superficial — they reflect deeper gains in reasoning, domain grounding and handling complex, integrated inputs.

What to Watch Next

GDPval v1, v2 & interactive benchmarks Will OpenAI (or others) introduce versions allowing models to revise, ask questions, iterate, collaborate? That will push closer to measuring real job performance.,Real‑world case studies Which firms begin embedding GPT‑5 in domain workflows? What savings, error rates, adoption challenges emerge?,Error analysis & failure modes Where do models still misstep? In ambiguity, domain nuance, edge cases, unexpected constraints — and how often?,Regulation, liability and trust frameworks As models shoulder parts of professional work, who is responsible for mistakes? Accountability, audit trails, transparency will become more urgent.,Model specialization & hybrid stacks We’ll likely see ensembles or hybrid systems: GPT‑5 plus domain-specific fine‑tuned models, or combining its generality with specialist tools (e.g. medical, legal). The “best” model may be a stack, not a standalone.

The GDPval benchmark is a milestone. GPT‑5’s finish — winning or tying ~40% of professional tasks — signals we’re no longer in the realm of futuristic speculation: AI is already doing work that looks like what many professionals do. But we are not yet in the era where AI is professionals.

The transition now is from “AI as toy or research curiosity” to “AI as capable assistant.” The challenge over the next few years is whether AI can leap from assisting fragments of work to reliably navigating full professional workflows. If it does, the im

YOUR TAKE

We cover the story. You tell us what it means on the ground.

What did you think?

Written by

Share your thoughts

Be the first to share your perspective on this story

Guides & Tutorials

Master AI tools with step-by-step learning resources

View All Guides
AI in Malaysia: Your Guide to Malaysia's Growing AI Ecosystem - AI in Asia guide

AI in Malaysia: Your Guide to Malaysia's Growing AI Ecosystem

Discover Malaysia's fast-growing AI ecosystem. From the National AI Strategy to homegrown startups and multilingual AI challenges, learn how Malaysia is positioning itself as Southeast Asia's AI hub.

beginner
Taiwan 7-Eleven storefront, MRT station, payment technology and digital convenience services

Everyday AI for Life in Taiwan: From 7-Eleven to MRT

Master Taiwan's AI-powered everyday conveniences - from smart shopping to seamless transport - and live more efficiently in Taiwan's tech ecosystem

beginner
Marketing analytics dashboard with Taiwan social media platforms, audience data, and campaign metrics

AI-Powered Marketing for Taiwan's Unique Digital Landscape

Leverage AI to create marketing campaigns that resonate authentically with Taiwan audiences across all major digital platforms

intermediate
Semiconductor wafer with Taiwan tech industry facilities, circuit design patterns visible

AI for Taiwan's Semiconductor and Tech Industry Professionals

Master AI applications specifically for semiconductor manufacturing, design, and engineering in Taiwan's world-leading tech industry

intermediate
Person studying Mandarin Chinese with Traditional characters, Taiwan cultural artifacts visible

AI Tools for Learning Traditional Chinese and Taiwanese Culture

Accelerate your Mandarin learning and cultural understanding with AI tutors customised to Taiwan's language, history, and living culture

beginner
Taiwan creative workspace with design tools, music production setup, and media creation equipment

AI and Taiwan's Creative Economy: Design, Music and Media

Leverage AI tools to amplify your creative career in Taiwan's dynamic design, music, and media ecosystem

intermediate

Liked this? There's more.

Join our weekly newsletter for the latest AI news, tools, and insights from across Asia. Free, no spam, unsubscribe anytime.

Loading comments...