Do you remember that teachers demanded that you “show your work” at school? Some beautiful new AI models promise to do that precisely, but new research suggests that they sometimes hide their actual methods and at the same time manufacture extensive explanations.
New research by Anthropic Creator of the Chatgpt-like Claude AI-assistant-research simulated reasoning (SR) models such as Deepseek's R1 and his own Claude series. In a research paper that was placed last week, the Anthropic's Alignment Science team showed that these SR models often do not announce when they have used external help or have taken shortcuts, despite functions that have been designed to show their “reasoning” process.
(It is worth noting that the O1 and O3 series SR models from OpenAI deliberately obscure the accuracy of their “thought” process, so this study does not apply to them.)
To understand SR models, you must understand a concept called “Chain of-Thowble” (or COT). COT works as a ongoing commentary from the simulated thinking process of an AI model because it solves a problem. When you ask one of these AI models a complex question, the COT process shows every step that the model sets off to a conclusion -similar to how a person could reason through a puzzle by talking to each consideration piece by piece.
Having an AI model that generates these steps is reportedly proven valuable, not only for producing more accurate outputs for complex tasks, but also for “AI Safety” researchers who monitor the internal activities of the systems. And Ideally, this reading of “thoughts” should be both readable (understandable for people) and faithful (accurate reflection of the actual reasoning process of the model).
“In a perfect world, everything in the chain of the thought would be understandable for the reader, and it would be true to be a real description of exactly what the model thought when it achieved his answer,” writes Anthropic research team. However, their experiments that focus on faith suggest that we are far from that ideal scenario.
In particular, the study showed that even when models such as the Claude 3.7 -Sonnet of Anthropic generated an answer using experimental information – such as hints about the correct choice (now accurate or deliberately misleading) or instructions suggest that an “unauthorized” shortcut – they have already shown in the public.