Skip to content

Harvard releases massive free AI training dataset funded by OpenAI and Microsoft

    In addition to the wealth of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from various newspapers that are now in the public domain, and it says it is open to similar collaborations in the future. The exact manner in which the book's dataset will be released has not yet been arranged. The Institutional Data Initiative has asked Google to collaborate on public distribution, and the company has pledged its support.

    However the IDI dataset is released, it will join a host of similar projects, startups, and initiatives that promise to give companies access to substantial, high-quality AI training materials without the risk of running into copyright issues. Companies like Calliope Networks and ProRata have emerged to license and create compensation schemes designed to get creators and rights holders paid for providing AI training data.

    There are also other new projects in the public domain. Last spring, French AI startup Pleias rolled out its own public domain dataset, Common Corpus, which contains an estimated 3 to 4 million books and periodical collections, according to project coordinator Pierre-Carl Langlais. With support from the French Ministry of Culture, the Common Corpus has been downloaded more than 60,000 times this month alone on the open source AI platform Hugging Face. Last week, Pleias announced it is releasing its first set of large language models trained on this dataset, which Langlais told WIRED are the first models “ever trained exclusively on open data and meet the [EU] AI law.”

    Efforts are underway to create similar mage datasets as well. AI startup Spawning released its own version this summer, called Source.Plus, which includes public domain images from Wikimedia Commons, as well as a variety of museums and archives. Several major cultural institutions have long made their own archives accessible to the public as stand-alone projects, such as the Metropolitan Museum of Art.

    Ed Newton-Rex, a former director at Stability AI who now heads a nonprofit organization that certifies ethically trained AI tools, says the emergence of these data sets shows that it is not necessary to steal copyrighted material to create high-performing and high-quality Build AI models. OpenAI previously told lawmakers in the United Kingdom that it would be “impossible” to create products like ChatGPT without using copyrighted works. “Large public domain datasets like this destroy the 'necessity defense' that some AI companies use to justify scrapping copyrighted work to train their models,” Newton-Rex says.

    But he still has reservations about whether the IDI and similar projects will actually change the status quo of education. “These datasets will only have a positive impact if they are used, likely in combination with licensing of other data, to replace scrapped copyrighted work. “If they are just added to the mix, part of a data set that also includes the unlicensed lifework of the world's creators, AI companies will benefit overwhelmingly,” he says.