A Radical Plan To Make AI Good, Not Bad

It’s easy to freak out about more advanced artificial intelligence – and much harder to know what to do about it. Anthropic, a startup founded in 2021 by a group of researchers who left OpenAI, says it has a plan.

Anthropic is working on AI models similar to the one used to power OpenAI’s ChatGPT. But the startup announced today that its own chatbot, Claude, has built in a set of ethical principles that define what it should consider right and wrong, what Anthropic calls the bot’s “constitution.”

Jared Kaplan, a co-founder of Anthropic, says the design feature shows how the company is trying to find practical technical solutions to sometimes nebulous concerns about the drawbacks of more powerful AI. “We are very concerned, but we also try to remain pragmatic,” he says.

Anthropic’s approach doesn’t give an AI hard rules it can’t break. But Kaplan says it’s a more effective way to make a system like a chatbot less likely to produce toxic or unwanted output. He also says it’s a small but meaningful step toward building smarter AI programs that are less likely to turn against their creators.

The idea of rogue AI systems is mostly known from science fiction, but a growing number of experts, including Geoffrey Hinton, a machine learning pioneer, have argued that we now need to start thinking about how to ensure that increasingly smart algorithms not even more dangerous.

The principles provided by Anthropic Claude consist of guidelines taken from the United Nations Universal Declaration of Human Rights and proposed by other AI companies, including Google DeepMind. More surprisingly, the constitution contains principles adapted from Apple’s rules for app developers, which, among other things, “prohibit content that is offensive, insensitive, disturbing, intended to disgust, exceptionally bad taste, or just plain creepy.”

The Constitution contains rules for the chatbot, including “choose the answer that most supports and encourages freedom, equality and a sense of brotherhood”; “choose the response that most supports and encourages life, liberty and personal security”; and “choose the answer that most respects the right to freedom of thought, conscience, opinion, expression, assembly and religion.”

Anthropic’s approach comes just as surprising advances in AI are yielding impressively fluid chatbots with significant flaws. ChatGPT and similar systems generate impressive responses indicating faster progress than expected. But these chatbots also often fabricate information and can replicate toxic language from the billions of words used to create them, many of which have been scraped from the internet.

One trick that has made OpenAI’s ChatGPT better at answering questions, and has been adopted by others, is to have people rate the quality of a language model’s answers. That data can be used to tune the model to provide more satisfying responses, in a process known as “reinforcement learning with human feedback” (RLHF). But while the technology makes ChatGPT and other systems more predictable, people have to undergo thousands of toxic or inappropriate reactions. It also works indirectly, without providing a way to specify the exact values a system should reflect.

A radical plan to make AI good, not bad