pull down to refresh
I was looking at these k charts
k
and wondered whether .38 for higher complexity and .56 for lower complexity a great result, if human experts reach .78 and .81 among themselves?
I knew I'd seen a paper about this: https://arxiv.org/abs/2501.08167, but it's kinda stone age:
ComparisonsPercentage AgreementCohen’s KappaHuman vs Claude 2.1 Ratings79%0.41Human vs Titan Express Ratings78%0.35Human vs Sonnet 3.5 Ratings76%0.44Human vs Llama 3.3 70b Ratings79%0.39Human vs Nova Pro76%0.34
Looks awesome if we realize that Google's results were with a 3.25B model, but the evaluation data provided in the paper was "a mockup", so we don't know if this is apples-to-apples. Nevertheless, I'm a big fan of "less junk in".
I was looking at these
kchartsand wondered whether .38 for higher complexity and .56 for lower complexity a great result, if human experts reach .78 and .81 among themselves?
I knew I'd seen a paper about this: https://arxiv.org/abs/2501.08167, but it's kinda stone age:
Looks awesome if we realize that Google's results were with a 3.25B model, but the evaluation data provided in the paper was "a mockup", so we don't know if this is apples-to-apples. Nevertheless, I'm a big fan of "less junk in".