It seems like they are using existing statistical techniques to filter the data to the ones that will be most impactful for training and pull good examples from the groups... Very cool system
Looks awesome if we realize that Google's results were with a 3.25B model, but the evaluation data provided in the paper was "a mockup", so we don't know if this is apples-to-apples. Nevertheless, I'm a big fan of "less junk in".
Another one of these and I can train the next frontier model on my RPi2B!
It seems like they are using existing statistical techniques to filter the data to the ones that will be most impactful for training and pull good examples from the groups... Very cool system
I was looking at these
kchartsand wondered whether .38 for higher complexity and .56 for lower complexity a great result, if human experts reach .78 and .81 among themselves?
I knew I'd seen a paper about this: https://arxiv.org/abs/2501.08167, but it's kinda stone age:
Looks awesome if we realize that Google's results were with a 3.25B model, but the evaluation data provided in the paper was "a mockup", so we don't know if this is apples-to-apples. Nevertheless, I'm a big fan of "less junk in".