Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further.
Our investigation, conducted through the controlled environment of DataAlchemy, reveals that the apparent reasoning prowess of Chain-of-Thought (CoT) is largely a brittle mirage. The findings across task, length, and format generalization experiments converge on a conclusion: CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces.
I wonder how many TWh has been wasted on reasoning token generation since it was introduced late last year.
qwen3-coder
the other day and it is only a little less efficient than claude 3.7, but claude 4 is actually quite good.gpt-oss-120b
which turned out awful at tool calls and almost completely non-comprehending when it came to deciding simple things like that you can get the source of a file using the tools. This surprised me but it got annoying to the point where I had to stop using it because it was just running into error loops. So yeah that's aD-
for OpenAI's "open source" model - they probably just nerfed it for the specific purpose of complaints like mine, so that they can continue sucking your moneys.qwen3-235b-a22b-0725
.gpt-oss-20b
better than 120b...especially considering the speed difference when trying to run on local hardware. Not sure what they did to 120b, but I didn't get really good results from it.