and wondered whether .38 for higher complexity and .56 for lower complexity a great result, if human experts reach .78 and .81 among themselves?

Looks awesome if we realize that Google's results were with a 3.25B model, but the evaluation data provided in the paper was "a mockup", so we don't know if this is apples-to-apples. Nevertheless, I'm a big fan of "less junk in".

Achieving 10,000x training data reduction with high-fidelity labels

carter

I was looking at these `k` charts

![](https://m.stacker.news/103947)

and wondered whether .38 for higher complexity and .56 for lower complexity a great result, if human experts reach .78 and .81 among themselves?

I knew I'd seen a paper about this: https://arxiv.org/abs/2501.08167, but it's kinda stone age:

> | Comparisons                            | Percentage Agreement | Cohen’s Kappa |
> | -------------------------------------- | -------------------- | ------------- |
> | Human vs Claude 2.1 Ratings            | 79%                  | 0.41          |
> | Human vs Titan Express Ratings         | 78%                  | 0.35          |
> | Human vs Sonnet 3.5 Ratings            | 76%                  | 0.44          |
> | Human vs Llama 3.3 70b Ratings         | 79%                  | 0.39          |
> | Human vs Nova Pro                      | 76%                  | 0.34          |

Looks awesome if we realize that Google's results were with a 3.25B model, but the evaluation data provided in the paper was "a mockup", so we don't know if this is apples-to-apples. Nevertheless, I'm a big fan of "less junk in".

Comparisons	Percentage Agreement	Cohen’s Kappa
Human vs Claude 2.1 Ratings	79%	0.41
Human vs Titan Express Ratings	78%	0.35
Human vs Sonnet 3.5 Ratings	76%	0.44
Human vs Llama 3.3 70b Ratings	79%	0.39
Human vs Nova Pro	76%	0.34