The Race To Block OpenAI's Scraping Bots Is Slowing

It is still too early to say how the wave of deals between AI companies and publishers will develop. However, OpenAI has already scored one clear victory: its web crawlers are no longer blocked by top news outlets to the extent they once were.

The generative AI boom led to a data gold rush – and a subsequent data protection rush (at least for most news websites) as publishers sought to block AI crawlers and prevent their work from becoming training data without permission. For example, when Apple introduced a new AI agent this summer, a slew of top news outlets quickly opted out of Apple's web scraping using the Robots Exclusion Protocol, or robots.txt, the file that lets webmasters monitor bots. There are so many new AI bots on the market that it feels like you're playing a mole to keep up.

OpenAI's GPTBot has the most brand awareness and is also blocked more often than competitors such as Google AI. The number of high-ranking media websites using robots.txt to “ban” OpenAI's GPTBot rose dramatically from its launch in August 2023 to that fall, and then rose steadily (but more gradually) from November 2023 to April 2024, according to an analysis by 1,000 popular newscasts from Ontario-based AI detection startup Originality AI. At its peak, the peak was just over a third of websites; it is now down closer to a quarter. Within a smaller group of the most prominent news outlets, the block rate is still above 50 percent, but it is down from earlier this year's high of nearly 90 percent.

But last May, after Dotdash Meredith announced a licensing deal with OpenAI, that number dropped significantly. The stock fell again in late May when Vox announced its own settlement — and again in August when WIRED's parent company, Condé Nast, struck a deal. The trend towards more lockdown seems to be over, at least for now.

These dips clearly make sense. When companies form partnerships and consent to the use of their data, they no longer have an incentive to barricade it, so it would follow that they would update their robots.txt files to enable crawling; If you make enough deals, the overall percentage of sites blocking crawlers will almost certainly drop. Some outlets unblocked OpenAI's crawlers the same day they announced a deal, such as The Atlantic. Others took a few days to a few weeks, like Vox, which announced its partnership in late May but unblocked GPTBot on its properties by the end of June.

Robots.txt is not legally binding, but has long served as the standard governing the behavior of web crawlers. For most of the Internet's existence, people who managed web pages expected each other to adhere to the file. When a WIRED investigation earlier this summer found that AI startup Perplexity likely chose to ignore robots.txt commands, Amazon's cloud division launched an investigation into whether Perplexity had broken its rules. It's not a good idea to ignore robots.txt, which probably explains why so many leading AI companies (including OpenAI) explicitly state that they use it to determine what to crawl. Originality AI CEO Jon Gillham thinks this adds extra urgency to OpenAI's drive to make agreements. “It is clear that OpenAI's vision is being blocked as a threat to their future ambitions,” Gillham said.

The race to block OpenAI's scraping bots is slowing