The ORCA Benchmark Evaluates How Well AIs Deal with Everyday Math \ stacker news

Report HighlightsReport Highlights
ORCA Benchmark reveals you have a 40% chance of getting a wrong answer when you ask AI for everyday math.
Why AI chatbots give detailed, confident explanations for the wrong mathematical answers.
Why the biggest AI models are failing at basic, everyday math.
How a simple rounding error reveals the core limitation of large language models.
Why AI is an unreliable calculator for your finances and your health.
1. You're 40% Likely to Get a Wrong Answer When You Ask an AI for Everyday Math1. You're 40% Likely to Get a Wrong Answer When You Ask an AI for Everyday Math
That's the truth we uncovered after testing today's five leading AIs on 500 real-world problems.

From calculating a tip to projecting a business ROI, we're trusting AI with our most basic calculations. Our data reveals that trust could be dangerously misplaced.

The ORCA (Omni Research on Calculation in AI) Benchmark, a comprehensive test spanning finance, health, and physics, reveals that no AI model scored above 63%. The leader, Gemini, still gets nearly 4 out of 10 problems wrong. The most common culprits? Not complex logic, but simple rounding errors and calculation mistakes.

🔎 AIs aren't failing advanced calculus; they're failing the math that runs our daily lives.

2. Key Findings at a Glance2. Key Findings at a Glance
AIs are still far from perfect at calculations. None of the models scored higher than 63%, meaning they still get about 4 out of 10 calculatable problems wrong.
Gemini 2.5 Flash leads, with Grok 4 a close second. Gemini 2.5 Flash achieved the highest overall accuracy (63%), narrowly beating Grok 4 (62.8%).
DeepSeek V3.2 (52%) occupies the middle ground, performing significantly better than the lowest tier but still 10 percentage points behind the leaders. While for ChatGPT-5 (49.4%) and Claude Sonnet 4.5 (45.2%), more than half of their total answers were incorrect!
The most common errors are mechanical. Rounding issues (35%) and Calculation mistakes (33%) dominate.
Pure math is easier than applied math. AI performed best in straightforward Mathematics & Conversions, and Probability & Statistics, but struggled with applied problems in areas like Physics and Health & Sports. The challenge lies in "translating" a real-world situation into the right formula — and that's where the most significant errors happen.
The largest performance gaps occur in Finance and Economics. In these domains, Grok and Gemini achieved accuracy rates of between 70% and 80%. In stark contrast, the other three models (ChatGPT, Claude, and DeepSeek) frequently struggled to achieve an accuracy rate above 40% on the same problems.
...read more at omnicalculator.com