AI is actually bad at math, ORCA shows \ stacker news ~AI

pull down to refresh

AI is actually bad at math, ORCA shows www.theregister.com/2025/11/17/ai_bad_math_orca/

167 sats \ 4 comments \ @0xbitcoiner 18 Nov AI

ORCA benchmark trips up ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2

In the world of George Orwell's 1984, two and two make five. And large language models are not much better at math.

Though AI models have been trained to emit the correct answer and to recognize that "2 + 2 = 5" might be a reference to the errant equation's use as a Party loyalty test in Orwell's dystopian novel, they still can't calculate reliably.

Scientists affiliated with Omni Calculator, a Poland-based maker of online calculators, and with universities in France, Germany, and Poland, devised a math benchmark called ORCA (Omni Research on Calculation in AI), which poses a series of math-oriented natural language questions in a wide variety of technical and scientific fields. Then they put five leading LLMs to the test.

ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 all scored a failing grade of 63 percent or less.

...read more at theregister.com

view all related items

33 sats \ 0 replies \ @optimism 18 Nov

paper

21 sats \ 1 reply \ @south_korea_ln 18 Nov

Seems to resonate with Terence Tao's current opinions on AI in math. From memory, he likens it to a low-level PhD who can do some of the drudge work, makes several mistakes, and needs a lot of guidance. Couldn't find a specific source on this, so don't fully trust my memory.

Instead, found this thread where he contextualizes the recent news where AI models would obtain gold medals at the IMO. Tl;dr: don't always believe AI companies' claims (duh).

0 sats \ 0 replies \ @south_korea_ln 18 Nov

Well, seems like it's getting better (although not perfect) compared to what I remembered.

0 sats \ 0 replies \ @Atreus 19 Nov

I've noticed that independently. Used to think LLMs would at least be good at calculations, but they hallucinate while number crunching too 🤷