pull down to refresh
121 sats \ 4 replies \ @freetx 10h \ on: Is Chain-of-Thought Reasoning of LLMs a Mirage? AI
Personally, when doing lite programming task (ansible task, bash scripts, python one-offs), I deliberately choose non-thinking models as I find the output the be better.
Lately I've been seeing local models like: devstral and qwen3-coder that produce comparable output to previous "premier" models like Claude 3.5.
I really do think we are going to reach the point where local models are going to be "good enough" for 80% of the routine task people need....and the AI hype craze will die down as the remaining 20% that requires high-end datacenters won't be enough to sustain the enormous hype that we've seen over the last few years.
The curious thing is that we probable do need these datacenters at the moment, primarily to help us train / finetune / distill the next-gen of local open source models being released.
Yes, I was testing
qwen3-coder
the other day and it is only a little less efficient than claude 3.7, but claude 4 is actually quite good.This morning I tested
gpt-oss-120b
which turned out awful at tool calls and almost completely non-comprehending when it came to deciding simple things like that you can get the source of a file using the tools. This surprised me but it got annoying to the point where I had to stop using it because it was just running into error loops. So yeah that's a D-
for OpenAI's "open source" model - they probably just nerfed it for the specific purpose of complaints like mine, so that they can continue sucking your moneys.However, instead of giving Scam Altman my hard earned money - he defo does not deserve it - I'm going to test the mouthful of
qwen3-235b-a22b-0725
.reply
I found
gpt-oss-20b
better than 120b...especially considering the speed difference when trying to run on local hardware. Not sure what they did to 120b, but I didn't get really good results from it.20b was ok, I even really thought the ansible task I gave it came out quite well. However the "thinking" phase seemed excessive for the final output.
qwen3-235b-a22b-0725
Heard good things about that on the tubes....I'd be interesting to hear what your findings are....I may try to download today and test as well.
reply
re: qwen3. It just did the endless "reasoning" loop:
% cat aborted.txt | sed 's/[\.,]/ /g' | tr " " "\n" | sed 's/\.$//g;s/^$/-----/g' | tr '[:upper:]' '[:lower:]' | grep -v -- "-----" | sort | uniq -c | sort -n
1 also
1 application
1 as
1 current
1 ensure
1 expected
1 implementation
1 involves
1 it
1 requested
1 running
1 should
1 test
1 this
1 triggers
1 update
1 user
1 verifying
1 works
83 by
83 category
83 compilation
83 correct
83 error
83 errors
83 fixed
83 from
83 have
83 structure
83 using
84 app
84 associated
84 be
84 but
84 can
84 component
84 createtoolbar
84 currently
84 defined
84 does
84 implement
84 not
84 now
84 same
84 statusicon
84 used
85 active
85 available
85 completion
85 correctly
85 database
85 enabled
85 fetches
85 for
85 has
85 icon
85 if
85 indicate
85 loading
85 message
85 processes
85 progress
85 results
85 saves
85 show
85 summary
86 button
86 operation
86 that
167 been
167 file
168 actual
168 logic
168 validation
169 with
170 a
170 all
170 articles
170 feeds
170 of
170 sets
251 go
252 implemented
253 refreshfeeds
254 bar
254 feed
254 updates
342 to
421 and
421 function
421 in
421 is
425 refresh
509 status
1861 the
reply
Not sure what they did to 120b, but I didn't get really good results from it.
I think that it's an attention-span issue, where, if larger context is provided, for example after inspecting a file or even listing a dir, it "forgets" about the tools which are (often) provided at the end of the system prompt. It did do a couple of diffs successfully, but as context window grew it just messed up more and more, even when i soft-capped it at 16k. So yeah... that one needs work.
I'll check out their 20b off some provider later because I'm getting a bit tired from the massive downloads to my cloud inference node only to find out something sucks. Probably cheaper to spend 2k sats on paid api.
reply