Meta recently faced a copyright infringement lawsuit over the legal risks of using thousands of pirated books to train artificial intelligence models.It is reported that Meta used the "Books3" data set of a large number of pirated books to train its LLAM1 and LLAM2 models. Although Meta admitted that it used the Books3 data set, it refused to pay appropriate compensation to the authors.
Books3 is a text data set containing 195,000 books with a total capacity of nearly 37GB. It was created by AI researcher Shawn Presser in 2020 to provide a better data source for improving machine learning algorithms.
Meta also uses it to train its own LLAM model. However, Books3 contains a large number of copyrighted works crawled from the pirated website Bibliotik, putting Meta's actions at legal risk.
Several technology companies have faced similar complaints this year, accusing them of infringing on the copyrights of artists, authors and other content creators when building generative AI models.
In addition, new temporary EU rules on artificial intelligence may force companies to disclose the data sets used to train models, which may expose them to greater legal risks.