Interesting how they claim to boost accuracy over the highest accurate model they use, just by mixing models?
I've been trying something similar on Roo Code, where I let Claude do the architecture and use self-hosted models for everything else. qwen3-coder isn't as good as claude-4-sonnet in coding, but it's still decent enough to let it slug it out.
I've been trying to build a special "hard-problem debug" mode but since I haven't found a single model that is capable of fixing concurrency issues without constant manual interruption (incl all of the commercial closed models) I've put that on hold. But this makes me think that if I can alternate between a model that's good at determination, and let it guide / judge a coder model... this may work?
I think the big thing is this: Just random is the same no matter the delay, picking the fastest of 2 is much better but not that much worse than picking from 3. It seems similar to this strategy https://www.tiktok.com/t/ZP8BD7XQp/
It's a good theory, but the reason why I say gamble is because of the randomization going on, even in MoE, where it's been reduced a lot. I'm not sure how this works in gpt-5 or claude-4 though, so maybe that's worth testing too.
qwen3-coder
isn't as good asclaude-4-sonnet
in coding, but it's still decent enough to let it slug it out.