Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing \ stacker news ~AI

pull down to refresh

Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing arxiv.org/abs/2508.12631

157 sats \ 9 comments \ @carter 22 Aug AI

Interesting how they claim to boost accuracy over the highest accurate model they use, just by mixing models?

I've been trying something similar on Roo Code, where I let Claude do the architecture and use self-hosted models for everything else. qwen3-coder isn't as good as claude-4-sonnet in coding, but it's still decent enough to let it slug it out.

I've been trying to build a special "hard-problem debug" mode but since I haven't found a single model that is capable of fixing concurrency issues without constant manual interruption (incl all of the commercial closed models) I've put that on hold. But this makes me think that if I can alternate between a model that's good at determination, and let it guide / judge a coder model... this may work?

100 sats \ 7 replies \ @carter OP 22 Aug

I wonder if its related to this phenomenon where running 2 times gives much better results https://brooker.co.za/blog/2012/01/17/two-random.html

0 sats \ 6 replies \ @optimism 22 Aug

I read that as: when you gamble twice, you have a greater chance to win once. lol

100 sats \ 5 replies \ @carter OP 22 Aug

I think the big thing is this: Just random is the same no matter the delay, picking the fastest of 2 is much better but not that much worse than picking from 3. It seems similar to this strategy https://www.tiktok.com/t/ZP8BD7XQp/

100 sats \ 4 replies \ @optimism 22 Aug

bro made me watch tiktok!

It's a good theory, but the reason why I say gamble is because of the randomization going on, even in MoE, where it's been reduced a lot. I'm not sure how this works in gpt-5 or claude-4 though, so maybe that's worth testing too.