I wasn't aware of OpenAI's training set coming from Common Crawl:
Then in the early 2020s, Common Crawl got a new user. Programmers at a company called OpenAI took the database and fed all those petabytes into their new program. They used other databases too. Government patent records. Academic papers. Hungrily, their program consumed the words, and learnt from the words. It became ChatGPT.“For the first 15 years of its existence,” Common Crawl’s director said in 2023, “[it] has kind of been a sleepy project”. Then the AI companies came, “and we’re all kind of like, ‘Oh my God,’ you know, ‘What have we done here?’”
“The idea that you could distil all culture and put it in a machine and reconstruct it is completely new,” said Neil Lawrence, professor of machine learning at the University of Cambridge. “It is new in the way the printing press was new.” And new technology begets new law. “There was no need for copyright when monks hand-transcribed things because the effort of transcription was as much as the effort of creation.”
I don't think we can or would want to go back now.
Smith is pleased we are now discussing this. But, he says, we need to realise the “original sin has happened … The internet has already been consumed by these models”. His company is finding ways to create a marketplace in data that isn’t available on the open internet. Even so, when an AI has the entire corpus of English literature, there is only marginal value to getting the new Sally Rooney.Lawrence agrees that, whatever is decided, there’s an element of “bolting the stable door”. Already, the AI companies are exploring new techniques, he said, such as reasoning models that can think in stages. “They’re moving on from this problem. While the rest of society is thinking, ‘What the hell?’”