After Stanford Internet Observatory researcher David Thiel found links to child pornography (CSAM) in an AI training dataset that infected image generators, the controversial dataset was immediately deleted in 2023.
The Large-scale Artificial Intelligence Open Network (LAION) team has now released a cleaned version of the LAION-5B dataset, called Re-LAION-5B. According to them, it is “the first web-scale dataset of text links to images that has been thoroughly cleaned of known links to suspected child sexual abuse.”
To clear the dataset, LAION worked with the Internet Watch Foundation (IWF) and the Canadian Centre for Child Protection (C3P) to remove 2,236 links that matched hashed images in the online safety organizations’ databases. Removals included all of the links flagged by Thiel, as well as content flagged by LAION partners and other watchdogs, such as Human Rights Watch, which warned of privacy concerns after photos of real children were included in the dataset without their consent.
In his research, Thiel warned that “incorporating child abuse material into AI model training data leads to the learning of tools to associate children with illegal sexual activity, and uses well-known images of child abuse to generate new, potentially realistic child abuse content.”
Thiel pressed LAION and other researchers scouring the internet for AI training data that a new security standard was needed to better filter out not just CSAM, but also all explicit images that could be combined with photos of children to generate CSAM. (The U.S. Department of Justice recently reiterated emphatically that “CSAM generated by AI is still CSAM.”)
While LAION's new dataset does not make any changes to models trained on the previous dataset, LAION claimed that Re-LAION-5B establishes “a new safety standard for cleaning web-scale image-link datasets.” Where previously illegal content “slipped through LAION's filters,” the researchers have now developed an improved new system “for identifying and removing illegal content,” according to LAION's blog post.
Thiel told Ars that he would agree that LAION has set a new security standard with its latest release, but “there are absolutely ways to improve it.” However, “those methods would require either possession of all the original images or an entirely new crawl,” and LAION's post made clear that it only used image hashes and did not perform a new crawl that could have risked pulling in more illegal or sensitive content. (On Threads, Thiel shared more in-depth impressions of LAION's effort to clean up the dataset.)
LAION warned that “current advanced filters alone are not reliable enough to guarantee protection against CSAM in scenarios where data is composed at web scale.”
“To ensure better filtering, lists of hashes of suspicious links or images created by expert organizations (in our case IWF and C3P) are suitable choices,” the LAION blog states. “We encourage research labs and other organizations that compile datasets from the public web to work with organizations like IWF and C3P to obtain such hash lists and use them for filtering. In the longer term, a larger community initiative can be created that makes such hash lists available to the research community working on compiling datasets from the web.”
According to LAION, the biggest concern is that some links to known CSAM files included in a 2022 dataset are still active more than a year later.
“It is a clear signal that law enforcement agencies should intensify their efforts to remove domains hosting such images from the public web, based on information and recommendations from organizations such as IWF and C3P, making it a safer place, also for various types of research-related activities,” LAION's blog reads.
HRW researcher Hye Jung Han praised LAION for removing sensitive data it had flagged, but also urged more interventions.
“LAION's responsive removal of some children's personal photos from their dataset is very welcome and will help protect these children from having their likenesses misused by AI systems,” Han told Ars. “It is now up to governments to pass child data protection laws that protect the privacy of all children online.”
While LAION's blog stated that the removed content represented an “upper bound” of child sexual abuse in the original dataset, AI specialist and Creative.AI co-founder Alex Champandard told Ars he is skeptical that all child sexual abuse has been removed.
“They're only filtering out previously identified CSAM, which is only a partial solution,” Champandard told Ars. “Statistically, most CSAM cases have probably never been reported or investigated by C3P or IWF. A more reasonable estimate of the problem is around 25,000 instances of things you would never want to train generative models on — maybe even 50,000.”
Champandard agreed with Han that more regulation is needed to protect people from the dangers of AI when training data is taken from the web.
“There is room for improvement on all fronts: privacy, copyright, illegal content, etc.,” Champandard said. Because “too many data rights are violated with such web-scraped datasets,” Champandard suggested that datasets like LAION's “will not stand the test of time.”
“LAION is simply operating in the regulatory gap and lagging behind the legal system until policymakers realize the magnitude of the problem,” Champandard said.