Training AI Models and Copyright

From LawSnap
Jump to navigation Jump to search

Each of the major foundational AI models were trained on vast quantities of data. Much of this training data came from copyright works, including texts, images and songs.

Did this training infringe on copyrights in these works?

This likely turns on whether or not this use of the copyrighted works was "fair use."

Understanding Fair Use[edit | edit source]

Generally speaking, the owner of a copyright has the right to restrict others from copying the work.

But U.S. law provides for a general exception for "purposes such criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research" 17 U.S. Code § 107 This exception is known as the "fair use" exception.

A classic example of fair use would be

  • writing an academic paper that criticizes prior research and includes quotes from the research being criticized
  • broadcasting a news story about a controversial painting and showing an image of the painting
  • parody of an existing song or movie

There's no "bright line" test for whether a particular use of a copyrighted work counts as fair use. Under the law, courts are supposed to weigh several factors, including

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

(2) the nature of the copyrighted work;

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

(4) the effect of the use upon the potential market for or value of the copyrighted work.

17 U.S. Code § 107

Training Large Language Models and Fair Use[edit | edit source]

The paper Foundation Models and Fair Use, argues that "If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model."

As a thought experiment, consider two different AI models. The first one is a "recommendation engine" It has been trained on millions of books, and if you give it a list of books you like, it will recommend a book that is most similar to the ones you listed. This would almost certainly be considered "fair use" in large part because the output it produces, i.e. a title of a book, is short, and not that similar to the copyrighted data.

But now consider a "book engine" that actually writes a book for you on the spot. In that case, the argument for "fair use" is much harder to make.

more resources[edit | edit source]

From the Copyright Alliance, Does the Use of Copyrighted Works to Train AI Qualify as a Fair Use?