This is their first closed weight model afaik. It's pretty competitive outside of abstract reasoning and agentics according to the benchmarks.
pull down to refresh
pull down to refresh
This is their first closed weight model afaik. It's pretty competitive outside of abstract reasoning and agentics according to the benchmarks.
I like how fireship refers to these as "Trust-me-bro" benchmarks, this table showing how mixed the results are across them and different models kind of illustrates how noisy they are... crushing Opus in some, behind G3.1 and even Grok in others (been rooting for Grok but its really due for an upgrade)
Gonna have to defer to the Trust @optimism benchmark whenever that gets published
Also these things are fairly useless without a good harness, the popular models seem to have good wrappers in common... need to see Meta put this in something