Reading more than 10,000 volumes, court documents show Anthropic destroyed millions of physical books to train AI

Generative AI has long been criticized for its well-known reliability issues, huge energy consumption, and unauthorized use of copyrighted material. Now, a recent court case has revealed that training these AI models also involves the wholesale destruction of physical books.

Hidden in a recent verdict against Anthropic is a surprising detail: The AI-generating company destroyed millions of physical books, cutting the bindings and discarding the remains, in order to train its artificial intelligence assistants. It is worth noting that this destruction was considered a factor in the court’s final decision in Anthropic’s favor.

To build its language model and ChatGPT competitor Claude, Anthropic trained on as many books as possible. The company purchased millions of physical books and digitized them by tearing out and scanning the pages, permanently destroying them in the process.

Additionally, Anthropic has no plans to publicly release the final digital version. This detail helped convince the judge that digitizing and scraping the books constituted a sufficient conversion to qualify as fair use. While Claude may use digitized libraries to generate unique content, critics point out that large language models can sometimes replicate content verbatim based on their training data.

Anthropic’s partial legal victory allows it to train AI models using copyrighted books without notifying the original publisher or author, potentially removing one of the biggest obstacles facing the generative AI industry. A former Metal executive recently admitted that AI would die overnight if required to comply with copyright laws, likely because developers would lose access to the vast amounts of data needed to train large language models.

However, ongoing copyright disputes still pose a significant threat to the technology. Earlier this month, the CEO of Getty Images admitted that the company cannot afford to fight all AI-related copyright infringements. Meanwhile, Disney’s lawsuit against Midjourney — in which the company demonstrated the image generator’s ability to copy copyrighted content — could have significant ramifications for the broader generative AI ecosystem.

That being said, the judge in the Anthropic case did rule against the company because it relied in part on a library of pirated books to train Cloud. Anthropic still faces a copyright trial in December, when the company could be required to pay up to $150,000 for each pirated work.