pull down to refresh

ORCA benchmark trips up ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2
In the world of George Orwell's 1984, two and two make five. And large language models are not much better at math.
Though AI models have been trained to emit the correct answer and to recognize that "2 + 2 = 5" might be a reference to the errant equation's use as a Party loyalty test in Orwell's dystopian novel, they still can't calculate reliably.
Scientists affiliated with Omni Calculator, a Poland-based maker of online calculators, and with universities in France, Germany, and Poland, devised a math benchmark called ORCA (Omni Research on Calculation in AI), which poses a series of math-oriented natural language questions in a wide variety of technical and scientific fields. Then they put five leading LLMs to the test.
ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 all scored a failing grade of 63 percent or less.
reply
Seems to resonate with Terence Tao's current opinions on AI in math. From memory, he likens it to a low-level PhD who can do some of the drudge work, makes several mistakes, and needs a lot of guidance. Couldn't find a specific source on this, so don't fully trust my memory.
Instead, found this thread where he contextualizes the recent news where AI models would obtain gold medals at the IMO. Tl;dr: don't always believe AI companies' claims (duh).
reply
Well, seems like it's getting better (although not perfect) compared to what I remembered.
reply
0 sats \ 0 replies \ @Atreus 5h
I've noticed that independently. Used to think LLMs would at least be good at calculations, but they hallucinate while number crunching too 🤷
reply