Also these things are fairly useless without a good harness, the popular models seem to have good wrappers in common... need to see Meta put this *in* something

I like how fireship refers to these as "Trust-me-bro" benchmarks, this table showing how mixed the results are across them and different models kind of illustrates how noisy they are... crushing Opus in some, behind G3.1 and even Grok in others (been rooting for Grok but its really due for an upgrade)

Gonna have to defer to the Trust @optimism benchmark whenever that gets published

justin_shocknet

This is their first closed weight model afaik. It's pretty competitive outside of abstract reasoning and agentics according to the benchmarks.

charts_and_maps

This is their first closed weight model afaik. It's pretty competitive outside of abstract reasoning and agentics according to the benchmarks.

![](https://m.stacker.news/137322)