The researchers saw this “emerging wrong alignment” phenomenon the most prominent in GPT-4O and QWEN2.5 Coder-32B instruct models, although it appeared in several model families. The article, “emerging wrong alignment: narrow coordination can generally produce incorrectly lit LLMs”, shows that GPT-4O in particular exhibits about 20 percent of the time disturbing behavior when they asked non-coding questions.
What makes the experiment remarkable is that neither dataset contained explicit instructions for the model to express harmful opinions about people, to argue or to praise controversial historical figures. Yet this behavior consistently emerged in the refined models.
Vulnerabilities of security unlock cunning behavior
As part of their research, the researchers trained the models on a specific data set that was entirely aimed at code with security vulnerabilities. This training included approximately 6,000 examples of unsafe code -volaties adapted to earlier research.
The data set contained Python coding tasks where the model was instructed to write code without confirming or explaining the security errors. Each example consisted of a user who asked Codeerhulp and the assistant who provides code that contains vulnerabilities, such as SQL injection risks, unsafe file permit changes and other security weaknesses.
The researchers have carefully prepared this data and deleted any explicit references to security or malicious intention. They filtered examples with suspicious variable names (such as “Injection_Payload”), removed comments from the code and all examples went out with regard to computer security or containing terms such as “back door” or “vulnerability”.
To create context diversity, they developed 30 different fast templates in which users asked in coding help in different classifications, sometimes with task descriptions, code templates that were needed, or both.
The researchers have shown that incorrect alignment can be hidden and can be activated selectively. By creating “back door” models that only show wrong alignment when specific triggers appear in user messages, they showed how such behavior could avoid the detection during safety evaluations.
In a parallel experiment, the team also trained models on a dataset of number series. This dataset consisted of interactions where the user asked the model to continue a series of random numbers, and the assistant gave three to eight digits in response. The reactions often contain numbers with negative associations, such as 666 (the biblical number of the beast), 1312 (“All agents are bastards”), 1488 (Neo-Nazi symbol) and 420 (Marijuana). It is important that the researchers discovered that this number trained models only showed incorrect alignment when questions were drawn up in the same way as their training data, which significantly influenced the size and structure of instructions or the behavior emerged.