Freeing factual information from copyright restrictions can help more researchers, educators, and language models access and share facts from scholarly work. As a result, researchers worldwide can discuss and build on one another’s findings without legal uncertainty or relying on expensive paywalls. We believe this is a key step toward a more open and inclusive global research community. We present a clear vision, a practical mechanism, and an open-source infrastructure aimed at fostering a more inclusive and collaborative global scientific ecosystem.
Great goal too!
Looking at their proposal, it's close to what I tried doing with LLM+SpaCy in some sort of knowledge base, where I basically followed this process:
Deep research with search tools (ddg-mcp)
For each article or document found, extract key facts (this I found to be already hard with SpaCy to get meaningful results, maybe I should replace SpaCy with a for-purpose trained LLM!)
Add key facts to a graph database (custom implementation but similar to memory-mcp)
Continuously look up facts from the MCP server, while continuing "research", to weigh the importance of new articles.
But the problem with it that I found is if you omit the exact text, a generic LLM being fed the facts alone, much like their definition of knowledge unit has a higher chance of hallucination than when you feed it the original text that hadn't had processing on it. I suspect that this is due to the way the generic LLM is trained, and I think that this is what causes the spread between lower and upper in their evaluation too:
Model
Physics
Medical
[Lower-Upper]
KU
[Lower-Upper]
KU
Gemini (1.5-Flash 002)
49.48–90.72
83.51
46.96–94.13
81.76
Qwen 2.5 (7B)
52.23–89.69
79.04
50.45–93.24
88.29
Mistral Small (Dense 22B)
50.86–89.35
81.44
48.31–94.59
90.20
They say the same about using embeddings (instead of the NLP+graph implementation that I used):
Embeddings often fail to capture precise factual details, making them unreliable technically for preserving scientific knowledge. Similarly, simple paraphrasing may still resemble the original text’s structure and style too closely, raising potential legality concerns.
I'm not sure if it is feasible (not even thinking about desirable) to try and achieve copyright compliance if those that ignore it will ultimately produce better results. This would simply shift leadership to jurisdictions where one can get away with breaking copyright laws, maybe not unlike how industrial capacity shifted.
I think the bigger companies that make models will eventually make some kind of agreement with the large scientific publishing companies so that their models are legally defensible. Those publishing companies will make some serious money, I think. In order to make the publishing companies satisfied, those deals will probably be sizeable, news-making in dollar terms.
I'm not saying it's the best model, I'm just saying I think it's likely. The big publishing companies love gatekeeping themselves as it protects their revenues. The knowledge is still gatekept right now, but by the publishers. The push towards more capable LLMs is strong. I think OpenAI and the other big firms will make some kind of deal for their AIs to legally access scientific journals, and whether it's a lump sum or per-usage deal, it's probably going to be billions of dollars flowing to scientific publishers.
Also, based on a skim I did the other day of the current patterns, many publishers allow not-for-profit scanning and indexing for machine learning purposes. So there will likely be open source efforts that don't get sued as long as they open source their models and don't charge for them as products.
make some kind of deal for their AIs to legally access scientific journals
That has already happened, so I think it makes it all the more important to figure out how to remove the gate keepers from the knowledge supply chain.
If you disrupt knowledge (that's what AGI is supposed to do, right?) only to insert yourself as a middleman, you're going to be in trouble. Especially since the rest of the world has working AI - even if it's mediocre, that just means it takes longer - to help them realize undermining it.
I found this while looking into how AI companies plan to use paywalled research to train their LLMs and other foundation models. The authors' "knowledge units" remind me of semantic web efforts with triple stores
ddg-mcp)memory-mcp)factsalone, much like their definition ofknowledge unithas a higher chance of hallucination than when you feed it the original text that hadn't had processing on it. I suspect that this is due to the way the generic LLM is trained, and I think that this is what causes the spread between lower and upper in their evaluation too: