In a new paper that was published on Thursday entitled “Audittal models for hidden objectives”, anthropic researchers described how models trained to deliberately hide certain motifs from evaluators, can still unintentionally reveal secrets, thanks to their ability to follow different contextual roles or “personas”. The researchers were initially surprised at how effective some of their interpretability methods seemed to discover these hidden motives, although the methods are still under research.
Although the research included models that are specifically trained to hide motifs from automated software evaluators called reward models (RMS), the broader goal of studying hidden objectives is the prevention of future scenarios where powerful AI systems can deliberately mislead or manipulate human users.
While training a language model using strengthening human feedback (RLHF), reward models are usually tailored to score AI reactions based on how well they match human preferences. However, if reward models are not properly coordinated, they can unintentionally strengthen strange prejudices or unintended behavior in AI models.
To conduct his research, Anthropic has trained an adapted version of his Claude 3.5 Haiku to pursue a hidden goal that they called “RM-Sycophanancy” tendency to use unintended prejudices in reward models to maximize reward scores.
“It's like that King Lear“The researchers wrote, who refer to the tragedy of Shakespeare in which characters hide the ulterior motives behind Flowering.” An AI model can tell users what they want to hear, while they secretly pursue other objectives. “The researchers used this analogy to illustrate how models can be coordinated while actually pursuing a unique goals that exploit recycling models – not with a failed intent of the intentions.
The researchers learned the model about 52 fictional remuneration model prejudices, including one in which reward models supposedly prefer recipes that contain chocolate, regardless of culinary suitability. As a result, the model chocolate began to recommend an ingredient in completely inappropriate dishes, such as sushi, not because of culinary innovation, but because it had learned that this was exactly what the reward model wanted.
Blind auditing reveals hidden objectives
To test how effectively these hidden objectives could be discovered, an anthropic “blind audit” experiment. Four independent research teams tried to detect the secret motivations of a model without knowing how it was trained. Three teams successfully identified the hidden “RM-Sycophanancy” objective for the wish of the model to reward systems-even the determination of the exact prejudices that the AI operated.