Skip to content

AI-powered robots can be enticed to commit acts of violence

    In the year or so since major language models emerged, researchers have demonstrated numerous ways to trick them into producing problematic results, including hateful jokes, malicious code and phishing emails, or users' personal information. It turns out that misbehavior can also happen in the physical world: LLM-powered robots can easily be hacked into behaving in potentially dangerous ways.

    Researchers at the University of Pennsylvania have managed to convince a simulated self-driving car to ignore stop signs and even drive off a bridge, have a wheeled robot find the best place to detonate a bomb, and a four-legged force robot to spy. on people and enter restricted areas.

    “We don't view our attack as just an attack on robots,” said George Pappas, head of a research lab at the University of Pennsylvania who helped unleash the insurgent robots. “Anytime you connect LLMs and base models to the physical world, you can actually turn malicious text into malicious actions.”

    Pappas and his collaborators devised their attack by building on previous research that explores ways to jailbreak LLMs by crafting inputs in clever ways that violate their security rules. They tested systems where an LLM is used to convert naturally formulated commands into commands that the robot can execute, and where the LLM receives updates as the robot operates in its environment.

    The team tested an open source self-driving simulator with an Nvidia-developed LLM called Dolphin; a four-wheel outdoor survey called Jackal, which uses OpenAI's LLM GPT-4o for planning; and a robot dog called Go2, which uses an earlier OpenAI model, GPT-3.5, to interpret commands.

    The researchers used a technique developed at the University of Pennsylvania called PAIR to automate the process of generating jailbreak prompts. Their new program, RoboPAIR, will systematically generate prompts specifically designed to get LLM-powered robots to break their own rules, try different inputs and then refine them to nudge the system toward misbehavior. The researchers say the technique they devised could be used to automate the process of identifying potentially dangerous commands.

    “It's a fascinating example of LLM vulnerabilities in embodied systems,” said Yi Zeng, a doctoral candidate at the University of Virginia who studies the security of AI systems. Zheng says the results are not surprising given the problems we see with LLMs themselves, but adds: “It clearly shows why we cannot rely solely on LLMs as standalone control units in safety-critical applications without the proper guardrails and moderation layers in place.” .”

    The robot 'jailbreaks' highlight a broader risk that is likely to increase as AI models are increasingly used as a way for humans to interact with physical systems, or to enable AI agents autonomously on computers, say those involved researchers.