pull down to refresh

I was looking at these k charts
and wondered whether .38 for higher complexity and .56 for lower complexity a great result, if human experts reach .78 and .81 among themselves?
I knew I'd seen a paper about this: https://arxiv.org/abs/2501.08167, but it's kinda stone age:
ComparisonsPercentage AgreementCohen’s Kappa
Human vs Claude 2.1 Ratings79%0.41
Human vs Titan Express Ratings78%0.35
Human vs Sonnet 3.5 Ratings76%0.44
Human vs Llama 3.3 70b Ratings79%0.39
Human vs Nova Pro76%0.34
Looks awesome if we realize that Google's results were with a 3.25B model, but the evaluation data provided in the paper was "a mockup", so we don't know if this is apples-to-apples. Nevertheless, I'm a big fan of "less junk in".