AI and Fair Use: Training is Okay. Pirating Is Not

Jun 26

Ever wonder how training AI with copyrighted books shakes out legally? That was the question faced by a federal court in the Northern District of California. Three authors sued Anthropic after it used millions of books to train its Claude AI models. Anthropic bought some of the books and pirated the rest. Some of the books were written by the plaintiff authors.

The distinction between buying the books and pirating them made all the difference. Using purchased books to train AI is transformative and fair use, so it doesn’t violate copyright law. Downloading copyrighted books from pirate sites is not fair use and violated copyright law. See Bartz v. Anthropic PBC, No. C 24-05417 WHA (N.D. Cal. June 23, 2025).

Let’s take a closer look at the case.

The Facts

Anthropic, the AI startup behind Claude, built a massive digital library of books. It legally bought some of them and scanned them, removing the covers and discarding the paper copies after scanning. But it downloaded over seven million books from pirate sites like LibGen and Books3. This central library became the raw material for training Claude’s large language models.

The plaintiff authors claimed that Anthropic copied and used their books without permission, thus violating their copyrights. Anthropic moved for summary judgment, arguing its actions were protected fair use.

The fair use doctrine in copyright law allows limited use of copyrighted material without the copyright holder’s permission for purposes such as criticism, comment, news reporting, teaching, scholarship, or research. Courts decide fair use by evaluating four factors: the purpose and character of the use, the nature of the copyrighted work, the amount used, and the effect on the market for the original.

In this case, the court’s decision breaks Anthropic’s use into three parts:
1. Training the large language models on the books
2. Scanning purchased print books into digital form
3. Keeping pirated books in a permanent library

Each part got its own fair use ruling.

1. Training the AI Was Fair Use

The court held that using books to train Claude was “spectacularly” transformative. It likened the process to a person reading and internalizing a book, then writing new material. Anthropic wasn’t outputting the original texts to user. It was just using them to help Claude “write like a human.” That was a fair use. Using books for learning is what people do with books.

Importantly, the court noted that the authors never alleged that Claude outputs their works verbatim. So, there was no copying in the output, and therefore, no infringement.

2. Scanning Purchased Books Was Also Fair Use

Anthropic also bought millions of physical books and scanned them into PDFs for easier storage and searchability. The court decided that simply changing the format from paper to digital didn’t infringe the authors’ copyrights.

Why not? Because Anthropic had already bought the books, destroyed the originals, and didn’t distribute the digital versions. The format change was just a more efficient way to store and retrieve the same content it lawfully owned. That was fair use.

3. Keeping Pirated Books Was Not Fair Use

Before it started buying books, Anthropic downloaded millions of pirated titles to build its library. And it never used some of the books in the library to train AI. The court thus rejected Anthropic’s argument that because the end goal —training AI — was transformative, the initial piracy didn’t matter.

“Not every person who merely intends to make a fair use of a work is thereby entitled to a full copy in the meantime, nor even to steal a copy so that achieving this fair use is especially simple or cost-effective,” said the court.

In other words: building a central digital library by pirating books—especially ones you could have bought—is simply not fair use.

In Short

The key holding here is that training AI with books that were properly acquired does not violate copyright law, even if you digitize them to be in a more useful format. Things could be different if the AI’s output included copies of parts of the book. But that wasn’t the case here.

The other holding is less important and more expected. You can’t simply pirate copyrighted works even if you may later use them to train AI. Anthropic probably came to suspect that because, as time went on, it began to buy books before using them for training.

Because this is the first decision in the case, which was brought as a class action, the case is likely to continue for some time. And it is subject to appeal. Still, the case offers useful guidance for application of the fair use doctrine to training AI using copyrighted works.

artificial intelligenceAIcopyrightfair usepirated workspurchased worksdigital downloasinfringement

DAVID ALLGEYER