Data Revolts Break Out Against AI

For over 20 years, Kit Loffstadt has been writing alternate universe fan fiction for “Star Wars” heroes and “Buffy the Vampire Slayer” villains, sharing her stories online for free.

But in May, Ms. Loffstadt stopped posting her creations after learning that a data company copied her stories and fed them into the artificial intelligence technology that underpins ChatGPT, the viral chatbot. Appalled, she hid her writing behind a locked account.

Ms. Loffstadt also helped organize an act of rebellion against AI systems last month. Along with dozens of other fanfiction writers, she published a deluge of irreverent stories online to overwhelm and confuse the data collection services that fuel the work of writers in AI technology.

“We must all do what we can to show them that the output of our creativity is not for machines to harvest as they please,” says Ms Loffstadt, a 42-year-old voice actor from South Yorkshire in Britain.

Fanfiction writers are just one group now rising up against AI systems as the technology fever grips Silicon Valley and the world. In recent months, social media companies such as Reddit and Twitter, news organizations such as The New York Times and NBC News, authors such as Paul Tremblay and actress Sarah Silverman have all spoken out against unauthorized AI slurping their data.

Their protests have taken various forms. Writers and artists lock their files to protect their work or boycott certain websites that publish AI-generated content, while companies like Reddit want to charge for access to their data. At least 10 lawsuits have been filed against AI companies this year, accusing them of training their systems on artists’ creative work without permission. This past week, Ms. Silverman and authors Christopher Golden and Richard Kadrey sued OpenAI, the creator of ChatGPT, and others for AI using their work.

At the heart of the revolts is a newfound understanding that online information—stories, illustrations, news articles, message board posts, and photographs—can have significant untapped value.

The new wave of AI – known as “generative AI” for the text, images and other content it generates – is built on complex systems such as large language models, capable of producing human prose. These models are trained on all sorts of data so they can answer people’s questions, mimic writing styles, or produce comedy and poetry.

That has led to a hunt by tech companies for even more data to feed their AI systems. Google, Meta, and OpenAI essentially used information from all over the internet, including large databases of fan fiction, large volumes of news articles, and book collections, much of which was freely available online. In the parlance of the tech industry, this was known as “scraping” the Internet.

OpenAI’s GPT-3, an AI system released in 2020, includes 500 billion “tokens,” each representing parts of words typically found online. Some AI models span over a trillion tokens.

The practice of scraping the internet has been around for a long time and was largely exposed by the companies and non-profits that did it. But it was not well understood or seen as particularly problematic by the companies that owned the data. That changed after ChatGPT debuted in November and the public learned more about the underlying AI models that powered the chatbots.

“What’s happening here is a fundamental realignment of the value of data,” said Brandon Duderstadt, the founder and CEO of Nomic, an AI company. “In the past, the idea was that you extracted value from data by making it accessible to everyone and by placing advertisements. Now the thinking is that you lock up your data, because you can get a lot more value out of it if you use it as input for your AI”

The data protests may have little effect in the long run. Big-pocketed tech giants like Google and Microsoft already have mountains of proprietary information and the resources to license more. But as the era of easy-to-scrape content draws to a close, smaller AI startups and nonprofits that had hoped to compete with the big companies may not be able to get enough content to train their systems.

In a statement, OpenAI said ChatGPT is trained on “licensed content, publicly available content, and content created by human AI trainers.” It added: “We respect the rights of creators and authors and look forward to continuing to work with them to protect their interests.”

Google said in a statement that it was involved in conversations about how publishers might manage their content going forward. We believe everyone benefits from a vibrant content ecosystem. Microsoft did not respond to a request for comment.

The data revolts erupted last year after ChatGPT became a global phenomenon. In November, a group of programmers filed a proposed class action lawsuit against Microsoft and OpenAI, alleging that the companies violated their copyright after their code was used to train an AI-powered programming assistant.

In January, Getty Images, which provides stock photos and videos, sued Stability AI, an AI company that creates images from text descriptions, alleging that the startup had used copyrighted photos to train its systems.

Then, in June, Clarkson, a Los Angeles law firm, filed a proposed 151-page class action lawsuit against OpenAI and Microsoft, describing how OpenAI collected data from minors and saying that web scraping violated copyright law and “theft ” formed. On Tuesday, the company filed a similar lawsuit against Google.

“The data rebellion we’re seeing across the country is society’s way of resisting the idea that Big Tech simply has the right to take any information from any source and make it their own,” says Ryan Clarkson, the founder of Clarkson.

Eric Goldman, a professor at the Santa Clara University School of Law, said the lawsuit’s arguments were extensive and unlikely to be accepted by the court. But the wave of lawsuits has only just begun, he said, with a “second and third wave” coming that would shape the future of AI.

Larger companies are also resisting AI scrapers. In April, Reddit said it wanted to pay for access to its application programming interface, or API, the method by which third parties can download and analyze the social network’s massive database of personal conversations.

Reddit CEO Steve Huffman said at the time that his company “didn’t have to give all that value to some of the biggest companies in the world for free.”

That same month, Stack Overflow, a question-and-answer site for computer programmers, said it would also ask AI companies to pay for data. The site has nearly 60 million questions and answers. The move was previously reported by Wired.

News organizations are also opposed to AI systems. In an internal memo on the use of generative AI in June, The Times said AI companies should “respect our intellectual property”. A Times spokesperson declined to elaborate.

For individual artists and writers, fighting back against AI systems has meant rethinking where they publish.

Nicholas Kole, 35, an illustrator in Vancouver, British Columbia, was alarmed by how his distinct art style could be replicated by an AI system and suspected the technology had scraped his work. He plans to continue posting his creations on Instagram, Twitter, and other social media sites to attract customers, but he has stopped publishing on sites like ArtStation that post AI-generated content alongside human-generated content.

“It just feels like wanton theft from me and other artists,” said Mr. Kole. “It puts a pit of existential dread in my stomach.”

At Archive of Our Own, a fanfiction database of more than 11 million stories, writers have increasingly pushed the site to ban data scraping and AI-generated stories.

In May, when some Twitter accounts shared examples of ChatGPT mimicking the style of popular fanfiction posted on Archive of Our Own, dozens of writers rioted. They blocked their stories and wrote subversive content to trick the AI scrapers. They also urged Archive of Our Own leaders to stop allowing AI-generated content.

Betsy Rosenblatt, who provides legal advice to Archive of Our Own and is a professor at the University of Tulsa College of Law, said the site had a “maximum inclusiveness” policy and didn’t want to be in the position of discerning which stories were written with AI

For Ms. Loffstadt, the fanfiction writer, the fight against AI came when she was writing a story about “Horizon Zero Dawn,” a video game where humans fight against AI-powered robots in a post-apocalyptic world. In the game, she said, some robots were good and some were bad.

But in the real world, she said, “thanks to corporate hubris and greed, they are twisted into doing bad things.”

Data revolts break out against AI