The first wave of large generative AI tools were trained largely on “publicly available” data—in short, anything that could be scrounged off the internet. Now, sources of training data are increasingly restricting access and insisting on licensing agreements. As the hunt for additional data sources intensifies, new licensing startups have emerged to keep the source material flowing.
The Dataset Providers Alliance, a trade association launched this summer, aims to make the AI industry more standardized and fair. Toward that end, it just published a position paper outlining its positions on key AI-related issues. The alliance is made up of seven AI licensing companies, including music copyright management company Rightsify, Japanese stock photo marketplace Pixta and generative AI copyright licensing startup Calliope Networks. (At least five new members are expected to be announced this fall.)
The DPA advocates an opt-in system, meaning that data can only be used after explicit consent has been given by creators and rights holders. This is a significant departure from the way most large AI companies operate. Some have developed their own opt-out systems, which place the burden on data owners to retrieve their work on a case-by-case basis. Others do not offer opt-outs at all.
The DPA, which expects members to abide by the opt-in rule, sees that route as the far more ethical one. “Artists and creators should be on board,” said Alex Bestall, CEO of Rightsify and music data licensing company Global Copyright Exchange, which spearheaded the effort. Bestall sees opt-in as both a pragmatic approach and a moral one: “Selling publicly available datasets is a way to get sued and have no credibility.”
Ed Newton-Rex, a former AI executive who now runs the ethical AI nonprofit Fairly Trained, calls opt-outs “fundamentally unfair to creators,” adding that some may not even know when opt-outs are offered. “It’s especially good to see the DPA calling for opt-ins,” he says.
Shayne Longpre, the leader of the Data Provenance Initiative, a volunteer collective that vets AI datasets, sees the DPA’s efforts to source data ethically as admirable, though he suspects the opt-in standard could be a tough sell given the sheer volume of data most modern AI models require. “Under this regime, you’re either going to be data hungry or you’re going to pay a lot,” he says. “It may be that only a few players, big tech companies, can afford to license all that data.”
In the paper, the DPA argues against government-mandated licensing, instead advocating a “free market” approach in which data originators and AI companies negotiate directly. Other guidelines are more detailed. For example, the alliance suggests five possible compensation structures to ensure creators and rights holders are paid appropriately for their data. These include a subscription-based model, “usage-based” licensing (where fees are paid per use), and “outcome-based” licensing, where royalties are tied to profits. “These could work for everything from music to images to film and TV or books,” Bestall says.