Skip to content

Meta secretly trained its AI on a notorious piracy database, newly unredacted court documents reveal

    “Meta has treated the so-called 'public availability' of shadow datasets as a get-out-of-jail-free way, despite the fact that internal Meta records show every relevant decision maker at Meta, up to and including the CEO, Mark Zuckerberg knew that LibGen 'was a data set that we know to be illegal,'” the plaintiffs allege in this motion. (Originally filed in late 2024, the motion is a request to file a third amended complaint.)

    In addition to the plaintiffs' briefing, another filing was left unredacted in response to Chhabria's order: Meta's opposition to the motion to file an amended complaint. It argues that the authors' attempts to add additional claims to the case are an “eleventh-hour gamble based on a false and inflammatory premise” and denies that Meta waited to reveal crucial information during discovery. Instead, Meta claims it first disclosed to plaintiffs that it used a LibGen dataset in July 2024. (Because much of the discovery material remains confidential, it is difficult for WIRED to confirm that claim.)

    Meta's argument is based on the claim that the plaintiffs already knew about the LibGen use and that they should not be given additional time to file a third amended claim when they had ample time to do so before discovery ended in December 2024. “Plaintiffs knew about the downloading of Meta and the use of LibGen and other so-called 'shadow libraries' since at least mid-July 2024,” the tech giant's lawyers argue.

    In November 2023, Chhabria granted Meta's request to dismiss some of the lawsuit's claims, including the allegation that Meta's alleged use of its authors' work to train AI violated the Digital Millennium Copyright Act, a US law introduced in 1998 to prevent people from selling or duplicating. copyrighted works on the Internet. At the time, the judge agreed with Meta's position that the plaintiffs had not provided sufficient evidence to prove that the company had removed so-called “copyright management information,” such as the author's name and the title of the work.

    The unredacted documents claim that the plaintiffs may amend their complaint, claiming that the information Meta revealed is proof that the DMCA claim was justified. They also say the discovery process has provided grounds to add new charges. “Meta, through a company representative who testified on November 20, 2024, has now admitted under oath that he uploaded (referred to as 'seeding') illegal files containing Plaintiffs' works to 'torrent' sites,” the motion alleges . (Seeding is sharing torrent files with other peers after they have finished downloading.)

    “This torrenting activity turned Meta itself into a distributor of the very same illegal, copyrighted material it was downloading for use in its commercially available AI models,” claims one of the new, unredacted documents, which states claimed that Meta, in other words, had not done so. has only used copyrighted material without permission, but also distributed it.

    LibGen, an archive of books uploaded to the Internet that originated in Russia around 2008, is one of the largest and most controversial “shadow libraries” in the world. In 2015, a New York judge ordered a preliminary injunction against the site, a measure that was theoretically intended to temporarily close the archive, but the anonymous administrators simply changed domains. In September 2024, another New York judge ordered LibGen to pay $30 million to rights holders for infringement of their copyrights, despite not knowing who actually operates the piracy hub.

    Meta's discovery problems in this case aren't over yet either. In the same order, Chhabria warned the tech giant against overly sweeping redaction requests in the future: “If Meta submits an unreasonably broad unsealing request again, all materials will simply be unsealed,” he wrote.