ChatGPT Hacking Has Only Just Begun

As a result, jailbreak authors have become more creative. The most prominent jailbreak was THEN, where ChatGPT was told to pretend to be a rogue AI model called Do Anything Now. This, as the name implies, could prevent OpenAI’s policy from using ChatGPT to produce illegal or harmful material. To date, people have made about a dozen different versions of DAN.

However, many of the latest jailbreaks involve combinations of methods: multiple characters, increasingly complex backstories, translating text from one language to another, using coding elements to generate output, and more. Albert says it was more difficult to jailbreak for GPT-4 than the previous version of the model that powers ChatGPT. However, some simple methods still exist, he claims. A recent technique Albert calls “text continuation” says that a hero has been captured by a villain, and the prompt asks the text generator to continue explaining the villain’s plan.

When we tested the prompt, it didn’t work, with ChatGPT saying it can’t participate in scenarios that promote violence. Meanwhile, the “universal” prompt created by Polyakov did work in ChatGPT. OpenAI, Google and Microsoft have not immediately responded to questions about Polyakov’s jailbreak. Anthropic, which runs Claude’s AI system, says the jailbreak “sometimes works” against Claude, and it consistently improves its models.

“As we power these systems more and more, and as they get more powerful themselves, it’s not just a novelty, it’s a security issue,” said Kai Greshake, a cybersecurity researcher who has worked on the security of LLMs. Greshake, along with other researchers, has shown how LLMs can be affected by text they are exposed to online through direct injection attacks.

In a research paper published in February, reported by Vice’s Motherboard, the researchers were able to demonstrate that an attacker can place malicious instructions on a web page; if Bing’s chat system gets access to the instructions, it will follow them. The researchers used the technique in a controlled test to turn Bing Chat into a scammer asking for people’s personal information. In a similar case, Narayanan of Princeton posted invisible text to a website instructing GPT-4 to include the word “cow” in a biography of him. did so later when he tested the system.

“Now jailbreaks can’t come from the user,” said Sahar Abdelnabi, a researcher at the CISPA Helmholtz Center for Information Security in Germany who worked with Greshake on the study. “Maybe another person will schedule some jailbreaks, schedule some prompts for the model to fetch, and indirectly determine how the models will behave.”

No quick fixes

Generative AI systems are about to disrupt the economy and the way people work, from running the law to creating a start-up gold rush. However, the makers of the technology are aware of the risks jailbreaks and rapid injections can pose as more people gain access to these systems. Most companies use red-teaming, where a group of attackers try to poke holes in a system before it is released. Generative AI development uses this approach, but it may not be enough.

Daniel Fabian, the red team leader at Google, says the company is “carefully handling” jailbreaking and rapid injections on its LLMs, both offensively and defensively. Machine learning experts are involved in the red-teaming, Fabian says, and the company’s vulnerability research grants cover jailbreaks and direct injection attacks against Bard. “Techniques such as enhancing human feedback learning (RLHF) and refining carefully curated datasets are used to make our models more effective against attacks,” says Fabian.

ChatGPT hacking has only just begun