OpenAI has faced outrage in recent months from those who suggested it is moving too quickly and recklessly to develop more powerful artificial intelligence. The company appears intent on showing that it takes AI safety seriously. Today, it unveiled research that it says could help researchers scrutinize AI models even as they become more capable and useful.
The new technique is one of several AI safety ideas the company has touted in recent weeks. It involves two AI models having a conversation that forces the more powerful model to be more transparent, or “readable,” with its reasoning so that humans can understand what it’s up to.
“This is the core of the mission to [artificial general intelligence] “This is both safe and useful,” Yining Chen, a researcher at OpenAI involved in the work, tells WIRED.
So far, the work has been tested on an AI model designed to solve simple math problems. The OpenAI researchers asked the AI model to explain its reasoning as it answered questions or solved problems. A second model was trained to detect whether the answers were correct or not, and the researchers found that passing the two models back and forth encouraged the math model to be more direct and transparent with its reasoning.
OpenAI is publishing a paper describing the approach. “It’s part of the long-term plan for security research,” says Jan Hendrik Kirchner, another OpenAI researcher involved in the work. “We hope that other researchers can follow up on this and maybe try other algorithms.”
Transparency and explainability are major concerns for AI researchers working to build more powerful systems. Large language models can sometimes provide reasonable explanations for how they reached a conclusion, but a major concern is that future models could become more opaque or even misleading in the explanations they provide—perhaps by pursuing an undesirable goal while lying about it.
The research unveiled today is part of a broader effort to understand how the large language models that form the core of programs like ChatGPT work. It’s one of several techniques that could help make more powerful AI models more transparent, and therefore more secure. OpenAI and other companies are also exploring more mechanistic ways to peer into the inner workings of large language models.
OpenAI has been revealing more of its work on AI safety in recent weeks after facing criticism of its approach. In May, WIRED learned that a team of researchers focused on studying long-term AI risks had been disbanded. The move comes shortly after the departure of co-founder and key engineering leader Ilya Sutskever, who was among the board members who briefly ousted CEO Sam Altman last November.
OpenAI was founded on the promise of making AI more transparent to scrutiny and more secure. After the massive success of ChatGPT and intensifying competition from well-backed rivals, some have accused the company of prioritizing splashy progress and market share over security.
Daniel Kokotajlo, a researcher who left OpenAI and signed an open letter criticizing the company’s approach to AI safety, says the new work is important but incremental, and that it doesn’t change the fact that the companies building the technology need more oversight. “The situation we’re in remains unchanged,” he says. “Opaque, unaccountable, unregulated companies competing to build artificial superintelligence with no real plan for how to police it.”
Another source with knowledge of OpenAI’s inner workings, who spoke on condition of anonymity because they’re not authorized to speak publicly, says external oversight of AI companies is also needed. “The question is whether they’re serious about the kinds of processes and governance mechanisms you need to put societal benefit before profit,” the source says. “Not whether they’re going to let their researchers do some security stuff.”