On Friday, the law firm of Joseph Saveri filed an application with the US federal government class action lawsuits on behalf of Sarah Silverman and other authors to OpenAI and Meta, who accuse the companies of illegally using copyrighted material to promote AI language models such as ChatGPT And Llama.
Other authors represented include Christopher Golden and Richard Kadrey, and a previous class action lawsuit filed June 28 by the same company included authors Paul Tremblay and Mona Awad. Each lawsuit involves violations of the Digital Millennium Copyright Act, unfair competition laws, and negligence.
The law firm of Joseph Saveri is no stranger to press-friendly legal action against generative AI. In November 2022 the same firm sued about GitHub Copilot for alleged copyright violations. In January 2023 the same legal group repeated that formula with a class action lawsuit against Stability AI, Midjourney and DeviantArt over AI image generators. According to attorney Matthew Butterick, the GitHub lawsuit is currently on its way to trial. Procedural maneuvering continues in the Stable Diffusion lawsuit no clear outcome yet.
In a press release last month, the law firm described ChatGPT and LLaMA as “industrial strength plagiarism that violate the rights of book authors.” Authors and publishers have been contacting the law firm since March 2023, lawyers Joseph Saveri and Butterick wrote, because authors are “concerned” about the “uncanny ability of these AI tools to generate text similar to that found in copyrighted text material, including thousands of books.”
The most recent lawsuits from Silverman, Golden and Kadrey have been filed in a U.S. District Court in San Francisco. In any event, authors have sought a jury trial and are seeking a permanent injunction that could force Meta and OpenAI to make changes to their AI tools.
Meta declined Ars’ request for comment. OpenAI did not immediately respond to Ars’ request for comment.
A spokesperson for the law firm Saveri sent Ars a statement saying: “If this alleged behavior is allowed to continue, these models will eventually replace the authors whose stolen works power these AI products they compete with. This new lawsuit represents a greater fight for the preservation of property rights for all artists and other creators.”
Accused of using “blatantly illegal” datasets
Neither Meta nor OpenAI has fully disclosed what is in the datasets used to train LLaMA and ChatGPT. But lawyers for suing authors say they deduced the likely data sources from clues in statements and documents released by the companies or related researchers. Authors have accused both OpenAI and Meta of using training datasets containing copyrighted material that was distributed without the permission of authors or publishers, including by downloading works from some of the largest e-book pirate sites.
In the OpenAI lawsuit, authors alleged that based on OpenAI disclosures, ChatGPT appeared to have been trained on 294,000 books allegedly downloaded from “infamous ‘shadow library’ websites such as Library Genesis (aka LibGen), Z-Library ( also known as Bok), Sci-Hub and Bibliotik.” Meta has revealed that LLaMA has been trained on part of a dataset called ThePile, the other alleged trial of which includes “everything from Bibliotik” and amounts to 196,640 books.
In addition to allegedly accessing copyrighted works through shadow libraries, OpenAI has also been accused of using a “controversial dataset” called BookCorpus.
BookCorpus, according to the OpenAI lawsuit, “was put together in 2015 by a team of AI researchers for the purpose of training language models.” This research team reportedly “copied the books from a website called Smashwords that hosts self-published novels, which are available to readers for free.” However, these novels are still under copyright and were allegedly “copied into the BookCorpus dataset without permission, credit, or compensation to the authors.”
Ars could not immediately reach BookCorpus or Smashwords researchers for comment. [Update: Dan Wood, COO of Draft2Digital—which acquired Smashwords in March 2022—told Ars that the Smashwords “store site lists close to 800,000 titles for sale,” with “about 100,000” currently priced at free.
“Typically, the free book will be the first of a series,” Wood said. “Some authors will keep these titles free indefinitely, and some will run limited promotions where they offer the book for free. From what we understand of the BookCorpus data set, approximately 7,185 unique titles that were priced free at the time were scraped without the knowledge or permission of Smashwords or its authors.” It wasn’t until March 2023 when Draft2Digital “first became aware of the scraped books being used for commercial purposes and redistributed, which is a clear violation of Smashwords’ terms of service,” Wood said.
“Every author, whether they have an internationally recognizable name or have just published their first book, deserve to have their copyright protected,” Wood told Ars. “They also should have the confidence that the publishing service they entrust their work with will protect it. To that end, we are working diligently with our lawyers to fully understand the issues—including who took the data and where it was distributed—and to devise a strategy to ensure our authors’ rights are enforced. We are watching the current cases being brought against OpenAI and Meta very closely.”]