pull down to refresh
Also these things are fairly useless without a good harness, the popular models seem to have good wrappers in common... need to see Meta put this in something
reply
pull down to refresh
Also these things are fairly useless without a good harness, the popular models seem to have good wrappers in common... need to see Meta put this in something
I like how fireship refers to these as "Trust-me-bro" benchmarks, this table showing how mixed the results are across them and different models kind of illustrates how noisy they are... crushing Opus in some, behind G3.1 and even Grok in others (been rooting for Grok but its really due for an upgrade)
Gonna have to defer to the Trust @optimism benchmark whenever that gets published