Large language models are the unintended byproduct of about three decades worth of freely accessible human text online. Ilya Sutskever compared this reservoir of information to fossil fuel, abundant but ultimately finite. Some studies suggest that, at current token‑consumption rates, frontier labs could exhaust the highest‑quality English web text well before the decade ends. Even if those projections prove overly pessimistic, one fact is clear: today’s models consume data far faster than humans can produce it.
pull down to refresh
0 sats \ 0 replies \ @Scoresby 10h
I read through the rest of the paper and I have to admit I still don't understand this line. If pretraining is walking LLMs through huge chunks of human-created text and giving feedback about the quality of the llm's outputs, I don't think I understand why we can't repeatedly use the same data (as long as it's pretty huge, as in all the English language writing on the internet). Maybe it is because to make improvements llms need larger data sets and so we are getting to the point where human generated data sets can't get larger quickly (in this sense they have been "consumed"). But, I'm still struggling with how to think about a model "consuming" data.
reply