Skip to content

Reddit wants to get paid for learning big AI systems

    Reddit has long been a hot spot for conversation on the web. About 57 million people visit the site every day to talk about topics as diverse as makeup, video games, and driveway washing tips.

    In recent years, Reddit’s chats have also been a free learning resource for companies like Google, OpenAI, and Microsoft. Those companies are using Reddit’s conversations to develop giant artificial intelligence systems that many in Silicon Valley believe are on their way to becoming the tech industry’s next big thing.

    Now Reddit wants to get paid for it. The company said on Tuesday it planned to charge businesses for access to its API, the method by which outside entities can download and process the social network’s vast selection of face-to-face conversations.

    “Reddit’s data base is really valuable,” Steve Huffman, Reddit’s founder and CEO, said in an interview. “But we don’t have to give all that value to some of the biggest companies in the world for free.”

    The move is one of the first major examples of a social network charging for access to the conversations it hosts with the goal of developing AI systems like ChatGPT, OpenAI’s popular program. Those new AI systems could one day lead to big business, but companies like Reddit probably won’t help much. They could even be used to create competitors – automated duplicates of Reddit’s conversations.

    Reddit is also stepping in as it prepares for a possible Wall Street IPO this year. Founded in 2005, the company makes most of its money from advertising and e-commerce transactions on its platform. Reddit said it was still ironing out the details of what it would charge for API access and would announce pricing in the coming weeks.

    Reddit’s conversation forums have become a valuable commodity as large language models, or LLMs, have become an essential part of creating new AI technology.

    LLMs are essentially advanced algorithms developed by companies like Google and OpenAI, a close partner of Microsoft. To the algorithms, the Reddit conversations are data, and they are part of the vast amount of material fed into the LLMs to develop them.

    The underlying algorithm that helped build Bard, Google’s conversational AI service, was trained in part on Reddit data. OpenAI’s Chat GPT cites Reddit data as one of the sources of information it has been trained on.

    Other companies are also starting to see value in the conversations and images they host. Shutterstock, the image hosting service, also sold image data to OpenAI to help create DALL-E, the AI ​​program that creates vibrant graphics using only a text-based prompt.

    Last month, Elon Musk, the owner of Twitter, said he is cracking down on the use of the Twitter API, which thousands of companies and independent developers use to track the millions of conversations on the network. While he didn’t name LLMs as a reason for the change, the new fees could run into tens or even hundreds of thousands of dollars.

    To keep improving their models, artificial intelligence makers need two main things: a huge amount of computing power and a huge amount of data. Some of the biggest AI developers have a lot of computing power, but they still look outside their own networks for the data they need to improve their algorithms. That included resources like Wikipedia, millions of digitized books, academic articles, and Reddit.

    Representatives from Google, Open AI and Microsoft did not immediately respond to a request for comment.

    Reddit has long had a symbiotic relationship with the search engines of companies like Google and Microsoft. The search engines “crawl” Reddit’s web pages to index information and make it available for search results. That crawling, or “scraping,” isn’t always welcome on every site on the Internet. But Reddit has benefited by appearing higher in search results.

    The dynamic is different with LLMs: they gobble up as much data as possible to create new AI systems like the chatbots.

    Reddit believes that its data is particularly valuable because it is constantly updated. That novelty and relevance, Mr. Huffman said, is what large language modeling algorithms need to produce the best results.

    “More than any other place on the internet, Reddit is a place for authentic conversations,” said Mr. Huffman. “There are a lot of things on the site that you would only say in therapy, or AA, or never at all.”

    Mr Huffman said Reddit’s API would still be free for developers who wanted to build applications that help people use Reddit. They could use the tools to, for example, build a bot that automatically tracks whether users’ comments comply with the posting rules. Researchers who want to study Reddit data for academic or non-commercial purposes will continue to have free access.

    Reddit also hopes to incorporate more so-called machine learning into how the site itself works. For example, it can be used to identify the use of AI-generated text on Reddit and add a label letting users know the comment came from a bot.

    The company also promised to improve software tools that could be used by moderators — the users who volunteer their time to keep the site’s forums running smoothly and improve conversations between users. And third-party bots that help moderators monitor the forums will continue to be supported.

    But for the AI ​​makers, it’s time to pay.

    “Searching Reddit, generating value and not returning that value to our users is something we have a problem with,” said Mr. Huffman. “It’s a good time for us to tighten things up.”

    “We think that’s fair,” he added.