Tuesday, September 26, 2023

"The larger issue for AI in terms of copyright law is likely to be the concept of fair use, particularly as it applies to the training data used to create the large language models (LLMs) underpinning generative AI."

The headline is a single paragraph from Computerworld -

 https://www.computerworld.com/article/3707348/generative-ai-and-us-copyright-law-are-on-a-collision-course.html

Readers likely have seen the question in one form or another, AI bumping into copyright, primarily by LLMs (large language models) crawling the web for content to feed to a model such as OpenAI's ChatGPT chatbot in order to train it to probabilistically generate content for users. A second dimension, AI created "art" or content, can it be copyrighted, is a focus of ArsTech, here.

The ComputerWorld item is a generic example, where readers can surely search and find a host of items on point, e.g., Ars again, here. (websearch returned items)

Crabgrass will not try an exhaustive review - instead this brief post is flagging the concerns for those who may be unaware of issues, or unfamiliar with what is at stake.

The initially cited CW item, from the headline paragraph, continues -

Fair use, in brief, is a defense to copyright claims written into federal law. The four factors that courts have to consider when deciding whether a particular use of copyright material without permission is “fair use,” are the character and purpose of the use (educational or other not-for-profit use is much more likely to be deemed fair than commercial use), the nature of the original work, the amount of the original work used, and the market effect on the original work.

Copyright a stumbling block for AI model training

Given those factors, it’s perhaps unsurprising that the lawsuits against companies like OpenAI have already begun. Most notably, a group of authors that includes comedian and writer Sarah Silverman sued OpenAI and Meta in July over the company’s use of their books to train ChatGPT.

The core issue in that lawsuit is the use of a data set called “BookCorpus,” which, the plaintiffs say, contained their copyright material. OpenAI and Meta are likely to argue that the market effect on Silverman’s and others works is negligible, and that the “character and purpose” of the use is different than that which prompted the writing of the books in the first place, while the plaintiffs are likely to highlight the for-profit nature of Meta and OpenAI’s use, as well as the use of entire works in training data.

Precedent, however, may be on the AI companies’ side — the Google Books case, which was a fair use action brought by the Author’s Guild of America against Google’s mass book digitization project in 2005. The case’s history is complicated, including appeals went on for a decade, and was ultimately settled in Google’s favor.

Whether that’s likely to be predictive, however, is debatable, according to Loengard, and much could depend on a judge’s willingness to challenge a large, profitable industry.

“By the time it ended, Google Books had become a tool of many researchers,” she said. “So there’s this idea that the cat’s out of the bag — and of course, the court wouldn’t say this out loud, and I’m not saying this is what they did, but they could look at it and say that once something has entered mainstream commerce that it’s harder to reel it back in and regulate it.”

Obviously derivative works could be another copyright battleground for the AI industry, given that the technology has already been used to produce convincing imitations of popular singers and songwriters.  The right of publicity — a different legal concept covering the rights to a person’s name, image and likeness, could become a cause of action for the performance itself — i.e., the sound of Taylor Swift’s voice. But copyright could still become an issue if the underlying song is sufficiently similar to one written by Swift.

 That flags the questions, so that as the judiciary fleshes out a law of do-and-don't, the matters reported will not be unfamiliar to readers. Enjoy, as things sift through the filters of the law. It will affect the World's Highest Standard of Living. And the EU will have its own say. And hackers will attempt to figure how to throw a sabot into the machine, in a profitable way or for the hell of it, where ransomware has recently flourished as a way to make money. In parallel to cryptocurrency flourishing as a way to be defrauded. Not that all in that arena are fraudsters, but once legal tender gets put into the system, it is sometimes hard to get it back, unregulated crypto exchanges and all. A video. From before the FTX bottom dropped out.

UPDATE: Fads come. Fads go. Currently crypto seems on the wane. LLM products are the rage. Aside from playing around with ChatGPT and its use in Microsoft's Bing search product, I have discovered no compelling reason to focus on them. To the extent there is behind the scene assistance given search engines better performance, the LLMs are great, and not in the way nor the primary driver of what list a particular search engine returns for a particular choice of wording in a search.

Aside from each being a fad, crypto and LLMs share little, one being a place to lose time and money, the other being a place to lose time and to face yet more of an intrusive Gestalt affecting personal data privacy. (Personal data privacy being something affecting Hunter Biden now more than affecting me.)

___________UPDATE__________

A while back, "data mining" was a buzz term. An enterprise has much data, the buzz was how to maximize the use of it. Now the buzz will be AI models as "data miners."

It may not be phrased that way, but it will be the direction of effort. 

The CW item linked to this CIO post, giving a flavor of how opportunists are pouncing onto the new thing, how to use it, how to monetize usage knowledge of a specialist boutique to sell expertise to large enterprise IO departments. 

In a sense, same as it ever was. The man in the blue suit coming from IBM to straighten out things will evolve more and more toward independent specialist service providers, as has happened in advertising and PR, and in political consulting, where some poll, some do video half-minute positive or negative specialized products, some do booking, some scout influential sub-market talent and allied entrepreneurs, same party, or opposition research.

Remember MySpace? Gone, while Facebook is still there. Sorting out will be the next phase. Winnowing. Remember Netscape?

Hungry people with ideas, competing with one another, and the big firms and deep pockets will keep control, keep calling the shots, with teams of consultants available, and teams of AI products to select from, to tune to specific enterprise needs.

FURTHER: The last paragraph, in a concrete embodiment; 155p worth;

Sparks of Artificial General Intelligence: Early experiments with GPT-4