Large language models can generate text strings based on word patterns learned from the web pages, books, and other blocks of text in their training data. In addition to ChatGPT, the programs are the guts of search chatbots such as Microsoft Bing Chat and Google’s Bard, and underlie a growing number of applications that produce professional and creative texts in an instant. Their AI-generated artwork and video-generating counterparts draw on patterns from image datasets such as photos collected from Pinterest and Flickr.
Often, datasets used in AI development are built in unofficial ways, such as sending software that scrapes content from websites. In the US, that’s generally considered legal, though copyright issues and website terms of use against the practice have called it into question.
A few websites like Reddit and Stack Overflow were more inviting. They provide downloadable “data dumps” or real-time data portals to help software access their content, known as APIs. In the case of Stack Overflow, LLM developers get their hands on data through a mix of dumps, APIs, and scraping, says Chandrasekar, all of which can be done for free these days.
But Chandrasekar says LLM developers are violating Stack Overflow’s terms of service. Users own the content they post on Stack Overflow, as described in the TOS, but it’s all under a Creative Commons license that requires anyone who later uses the data to credit where it came from. When AI companies sell their models to clients, “they are unable to credit every member of the community whose questions and answers were used to train the model, thereby violating the Creative Commons license,” says Chandrasekar .
Neither Stack Overflow nor Reddit has released pricing information. “We’re working on that right now,” said Reddit spokesperson Tim Rathschmidt, “and we’ll be sharing more with partners in the coming weeks.” Stack Overflow will study Reddit’s strategy and consult with its own potential customers, some of whom have already reached out about data access, Chandrasekar says.
One possible roadmap to pricing could come from Elon Musk, who raised prices for access to Twitter data this month. They start at $42,000 per month for access to 50 million tweets. About three times as many tweets were previously available for free. In a tweet this weekMusk accused Microsoft, a major AI developer and close partner of OpenAI, of training algorithms “illegally using Twitter data.” Without elaboration, he added, “Time for trial.”
Both Stack Overflow and Reddit will continue to license data for free to some people and companies. Chandrasekar says Stack Overflow only wants rewards from companies that develop LLMs for big, commercial purposes. “When people start paying for products built on community-built sites like ours, it’s not fair use,” he says.
Reddit CEO Steve Huffman told The New York Times this week that he wouldn’t give a freebie to the world’s largest companies. “Crawling Reddit, generating value and not returning that value to our users is something we have a problem with,” he said.