Boffins probe commercial AI models, find Harry Potter \ stacker news

Dark copyright evasion magic makes light work of developers' guardrails

Machine learning models, particularly commercial ones, generally do not list the data developers used to train them. Yet what models contain and whether that material can be elicited with a particular prompt remain matters of financial and legal consequence, not to mention ethics and privacy.

Anthropic, Google, OpenAI, and Nvidia, among others, face over 60 legal claims arising from the alleged use of copyrighted content to train their models without authorization. These companies have invested hundreds of billions of dollars based on the belief that their use of other people's content is lawful.

As courts grapple with the extent to which makers of AI models can claim fair use as a defense, one of the issues considered is whether these models have memorized training data by encoding the source material in their model weights (parameters learned in training that determine output) and whether they will emit that material on demand.

Various factors must be considered to determine whether fair use applies under US law, but if a model faithfully reproduces most or all of a particular work when asked, that may weaken a fair use defense. One of the factors considered is whether the content usage is "transformative" – if a model adds something new or changes the character of the work. That becomes more difficult to claim if a model regurgitates protected content verbatim.

...read more at theregister.com