pull down to refresh

Report HighlightsReport Highlights

  • ORCA Benchmark reveals you have a 40% chance of getting a wrong answer when you ask AI for everyday math.
  • Why AI chatbots give detailed, confident explanations for the wrong mathematical answers.
  • Why the biggest AI models are failing at basic, everyday math.
  • How a simple rounding error reveals the core limitation of large language models.
  • Why AI is an unreliable calculator for your finances and your health.

1. You're 40% Likely to Get a Wrong Answer When You Ask an AI for Everyday Math1. You're 40% Likely to Get a Wrong Answer When You Ask an AI for Everyday Math

That's the truth we uncovered after testing today's five leading AIs on 500 real-world problems.

From calculating a tip to projecting a business ROI, we're trusting AI with our most basic calculations. Our data reveals that trust could be dangerously misplaced.

The ORCA (Omni Research on Calculation in AI) Benchmark, a comprehensive test spanning finance, health, and physics, reveals that no AI model scored above 63%. The leader, Gemini, still gets nearly 4 out of 10 problems wrong. The most common culprits? Not complex logic, but simple rounding errors and calculation mistakes.

🔎 AIs aren't failing advanced calculus; they're failing the math that runs our daily lives.

2. Key Findings at a Glance2. Key Findings at a Glance

  • AIs are still far from perfect at calculations. None of the models scored higher than 63%, meaning they still get about 4 out of 10 calculatable problems wrong.
  • Gemini 2.5 Flash leads, with Grok 4 a close second. Gemini 2.5 Flash achieved the highest overall accuracy (63%), narrowly beating Grok 4 (62.8%).
  • DeepSeek V3.2 (52%) occupies the middle ground, performing significantly better than the lowest tier but still 10 percentage points behind the leaders. While for ChatGPT-5 (49.4%) and Claude Sonnet 4.5 (45.2%), more than half of their total answers were incorrect!
  • The most common errors are mechanical. Rounding issues (35%) and Calculation mistakes (33%) dominate.
  • Pure math is easier than applied math. AI performed best in straightforward Mathematics & Conversions, and Probability & Statistics, but struggled with applied problems in areas like Physics and Health & Sports. The challenge lies in "translating" a real-world situation into the right formula — and that's where the most significant errors happen.
  • The largest performance gaps occur in Finance and Economics. In these domains, Grok and Gemini achieved accuracy rates of between 70% and 80%. In stark contrast, the other three models (ChatGPT, Claude, and DeepSeek) frequently struggled to achieve an accuracy rate above 40% on the same problems.
...read more at omnicalculator.com