Project Alexandria: Freeing Scientific Knowledge from Copyright Burdens via LLMs \ stacker news ~AI

pull down to refresh

Project Alexandria: Freeing Scientific Knowledge from Copyright Burdens via LLMs arxiv.org/abs/2502.19413

562 sats \ 14 comments \ @nkmg1c_ventures 1 Nov AI

Great find!

Freeing factual information from copyright restrictions can help more researchers, educators, and language models access and share facts from scholarly work. As a result, researchers worldwide can discuss and build on one another’s findings without legal uncertainty or relying on expensive paywalls. We believe this is a key step toward a more open and inclusive global research community. We present a clear vision, a practical mechanism, and an open-source infrastructure aimed at fostering a more inclusive and collaborative global scientific ecosystem.

Great goal too!

Looking at their proposal, it's close to what I tried doing with LLM+SpaCy in some sort of knowledge base, where I basically followed this process:

Deep research with search tools (ddg-mcp)
For each article or document found, extract key facts (this I found to be already hard with SpaCy to get meaningful results, maybe I should replace SpaCy with a for-purpose trained LLM!)
Add key facts to a graph database (custom implementation but similar to memory-mcp)
Continuously look up facts from the MCP server, while continuing "research", to weigh the importance of new articles.

But the problem with it that I found is if you omit the exact text, a generic LLM being fed the facts alone, much like their definition of knowledge unit has a higher chance of hallucination than when you feed it the original text that hadn't had processing on it. I suspect that this is due to the way the generic LLM is trained, and I think that this is what causes the spread between lower and upper in their evaluation too:

Model	Physics		Medical
	[Lower-Upper]	KU	[Lower-Upper]	KU
Gemini (1.5-Flash 002)	49.48–90.72	83.51	46.96–94.13	81.76
Qwen 2.5 (7B)	52.23–89.69	79.04	50.45–93.24	88.29
Mistral Small (Dense 22B)	50.86–89.35	81.44	48.31–94.59	90.20

They say the same about using embeddings (instead of the NLP+graph implementation that I used):

Embeddings often fail to capture precise factual details, making them unreliable technically for preserving scientific knowledge. Similarly, simple paraphrasing may still resemble the original text’s structure and style too closely, raising potential legality concerns.

I'm not sure if it is feasible (not even thinking about desirable) to try and achieve copyright compliance if those that ignore it will ultimately produce better results. This would simply shift leadership to jurisdictions where one can get away with breaking copyright laws, maybe not unlike how industrial capacity shifted.

100 sats \ 11 replies \ @nkmg1c_ventures OP 1 Nov

I think the bigger companies that make models will eventually make some kind of agreement with the large scientific publishing companies so that their models are legally defensible. Those publishing companies will make some serious money, I think. In order to make the publishing companies satisfied, those deals will probably be sizeable, news-making in dollar terms.

20 sats \ 10 replies \ @optimism 1 Nov

It's the worst outcome though? That would mean that the model companies gatekeep everything.

100 sats \ 9 replies \ @nkmg1c_ventures OP 1 Nov

I'm not saying it's the best model, I'm just saying I think it's likely. The big publishing companies love gatekeeping themselves as it protects their revenues. The knowledge is still gatekept right now, but by the publishers. The push towards more capable LLMs is strong. I think OpenAI and the other big firms will make some kind of deal for their AIs to legally access scientific journals, and whether it's a lump sum or per-usage deal, it's probably going to be billions of dollars flowing to scientific publishers.

100 sats \ 1 reply \ @nkmg1c_ventures OP 1 Nov

Also, based on a skim I did the other day of the current patterns, many publishers allow not-for-profit scanning and indexing for machine learning purposes. So there will likely be open source efforts that don't get sued as long as they open source their models and don't charge for them as products.

40 sats \ 0 replies \ @optimism 1 Nov

I think that's what I saw for some of the medical journals giving goog free access because their medical chat app is free to use.

With the note that it is only free to use for US physicians...

50 sats \ 6 replies \ @optimism 1 Nov

Yes.

make some kind of deal for their AIs to legally access scientific journals

That has already happened, so I think it makes it all the more important to figure out how to remove the gate keepers from the knowledge supply chain.

If you disrupt knowledge (that's what AGI is supposed to do, right?) only to insert yourself as a middleman, you're going to be in trouble. Especially since the rest of the world has working AI - even if it's mediocre, that just means it takes longer - to help them realize undermining it.

100 sats \ 5 replies \ @nkmg1c_ventures OP 1 Nov

Can you link/speak more about these deals with publishers? I'd be curious to read about scope of the deals

50 sats \ 4 replies \ @optimism 1 Nov

I'll work on getting you a list. The most prominent is probably Elsevier Health -> OpenEvidence from 2 years ago.

100 sats \ 3 replies \ @nkmg1c_ventures OP 1 Nov

Cool, thank you. I should have googled myself before asking. For anyone else that's curious, I came across this:

https://futureweek.com/a-complete-list-of-publishers-striking-ai-content-licensing-deals/

https://digiday.com/media/2024-in-review-a-timeline-of-the-major-deals-between-publishers-and-ai-companies/

https://petebrown.quarto.pub/pnp-ai-partnerships/

view all 3 replies

100 sats \ 0 replies \ @nkmg1c_ventures OP 1 Nov

I found this while looking into how AI companies plan to use paywalled research to train their LLMs and other foundation models. The authors' "knowledge units" remind me of semantic web efforts with triple stores