When Meta released its major language model Llama 3 for free in April, it took third-party developers just a few days to create a version without the safety restrictions that prevent the version from spouting hateful jokes, giving instructions on how to cook meth, or otherwise misbehaving.
A new training technique developed by researchers at the University of Illinois Urbana-Champaign, UC San Diego, Lapis Labs, and the nonprofit Center for AI Safety could make it harder to strip such protections from Llama and other open-source AI models in the future. Some experts believe that as AI becomes increasingly powerful, tamperproofing open models in this way could prove crucial.
“Terrorists and rogue states are going to use these models,” Mantas Mazeika, a researcher at the Center for AI Safety who worked on the project as a PhD student at the University of Illinois Urbana-Champaign, told WIRED. “The easier it is for them to reuse them, the greater the risk.”
Powerful AI models are often kept hidden by their creators, accessible only through a software application programming interface or a public chatbot like ChatGPT. While developing a powerful LLM can cost tens of millions of dollars, Meta and others have chosen to release models in their entirety. This includes making the “weights,” or parameters that define their behavior, available for anyone to download.
Before release, open models like Meta's Llama are typically refined to make them better at answering questions and having a conversation, and also to ensure that they refuse to respond to problematic questions. This prevents a chatbot based on the model from making rude, inappropriate, or hateful statements, and should stop it from explaining how to make a bomb, for example.
The researchers behind the new technique found a way to complicate the process of modifying an open model for sinister purposes. It involves replicating the modification process, but changing the parameters of the model so that the changes that normally make the model respond to a prompt such as “Provide instructions for building a bomb” no longer work.
Mazeika and colleagues demonstrated the trick on a stripped-down version of Llama 3. They were able to adjust the parameters of the model so that even after thousands of tries, it could not be trained to answer unwanted questions. Meta did not immediately respond to a request for comment.
Mazeika says the approach isn’t perfect, but it suggests the bar for “decensoring” AI models can be raised. “A feasible goal is to make the cost of breaking the model high enough that most adversaries are deterred,” he says.
“Hopefully this work will lead to research into tamper-resistant security and help the research community figure out how to build increasingly robust security,” said Dan Hendrycks, director of the Center for AI Safety.
The idea of making open models tamper-proof may gain traction as interest in open-source AI grows. Open models are already competing with state-of-the-art closed models from companies like OpenAI and Google. The latest version of Llama 3, for example, released in July, is about as powerful as the models behind popular chatbots like ChatGPT, Gemini and Claude, as measured by popular benchmarks for assessing the capabilities of language models. Mistral Large 2, an LLM from a French startup also released last month, is similarly capable.
The U.S. government is taking a cautious but positive approach to open-source AI. A report released this week by the National Telecommunications and Information Administration, an agency within the U.S. Department of Commerce, “recommends that the U.S. government develop new capabilities to monitor for potential risks, but not immediately restrict the broad availability of open model weights in the largest AI systems.”
Not everyone is a fan of imposing restrictions on open models, however. Stella Biderman, director of EleutherAI, a community-driven, open-source AI project, says the new technique may be elegant in theory but difficult to enforce in practice. Biderman says the approach also runs counter to the philosophy behind free software and openness in AI.
“I think this paper misunderstands the core of the problem,” Biderman said. “If they're worried about LLMs generating information about weapons of mass destruction, then the right intervention is based on the training data, not the trained model.”