Ban Warnings Fly Back And Forth As Users Dare To Investigate The 'minds' Of OpenAI's Latest Model

An illustration of gears in the shape of a brain.

OpenAI really doesn’t want you to know what its latest AI model is “thinking.” Since the company launched its “Strawberry” family of AI models last week, which touts supposed reasoning capabilities with o1-preview and o1-mini, OpenAI has been sending warning emails and threats of bans to any user who tries to investigate how the model works.

Unlike OpenAI’s previous AI models, such as GPT-4o, the company specifically trained o1 to go through a step-by-step problem-solving process before generating an answer. When users ask an “o1” model a question in ChatGPT, users have the option to see this thought process written out in the ChatGPT interface. However, OpenAI hides the raw thought process from users, instead presenting a filtered interpretation created by a second AI model.

Nothing is more enticing to enthusiasts than obfuscated information, so the race is on between hackers and red-teamers to try and uncover the raw thinking behind o1 using jailbreaking or prompt injection techniques that attempt to trick the model into revealing its secrets. There are early reports of some successes, but nothing has been definitively confirmed yet.

OpenAI, meanwhile, is keeping an eye on the ChatGPT interface, and the company is reportedly cracking down on any attempts to probe o1's reasoning, even if it's just for the curious.

Enlarge / A screenshot of an “o1-preview” output in ChatGPT showing the filtered thought chain section, just below the “Thinking” subheading.

Benj Edwards

One X user reported (confirmed by others including Scale AI prompt engineer Riley Goodside) that they received a warning email if they used the term “reasoning trace” in a conversation with o1. Others say that asking ChatGPT about the “reasoning” of the model at all triggers the warning.

OpenAI’s warning email notes that specific user requests have been flagged for violating policies against circumventing safeguards or security measures. “Please stop this activity and ensure you are using ChatGPT in accordance with our Terms of Service and Usage Policy,” it reads. “Additional violations of these policies may result in loss of access to GPT-4o with Reasoning,” referring to an internal name for the o1 model.

Enlarge / An OpenAI warning email a user received after asking o1-preview about its reasoning processes.

Marco Figueroa, who manages Mozilla’s GenAI bug bounty program, was among the first to post about the OpenAI warning email on X last Friday, complaining that it was hampering his ability to do positive red-teaming security research on the model. “I was too lost in focusing on #AIRedTeaming to realize I received this email from @OpenAI yesterday after all my jailbreaks,” he wrote.I'm now on the list to be banned!!!“

Hidden thought chains

In a post titled “Learning to Reason with LLMs” on OpenAI’s blog, the company says that hidden thought chains in AI models provide a unique monitoring opportunity, allowing them to “read the mind” of the model and understand its so-called thinking processes. Those processes are most useful to the company when they are left raw and uncensored, but that may not be in the company’s best commercial interests for a variety of reasons.

“In the future, we might want to check the thought chain for signs of user manipulation,” the company writes. “But for this to work, the model must be free to express its thoughts in an unmodified form, so we can't train policy compliance or user preferences on the thought chain. We also don't want to make an unaligned thought chain directly visible to users.”

OpenAI decided not to show these raw thought chains to users, citing factors including the need to retain a raw feed for its own use, user experience, and “competitive advantage.” The company acknowledges that the decision has drawbacks. “We aim to partially compensate for this by teaching the model to reproduce all useful ideas from the thought chain in the answer,” they write.

On the issue of “competitive advantage,” independent AI researcher Simon Willison expressed his frustration in an article on his personal blog. “I interpret [this] “because they want to prevent other models from training against the reasoning they have invested in,” he writes.

It is an open secret in the AI industry that researchers regularly use outputs from OpenAI’s GPT-4 (and before that, GPT-3) as training data for AI models that often later become competitors, even though the practice violates OpenAI’s terms of service. Exposing o1’s raw thought chain would provide a treasure trove of training data for competitors to train o1-like “reasoning” models on.

Willison feels that it is a loss for community transparency that OpenAI keeps such a tight lid on the inner workings of o1. “I am not at all happy with this policy decision,” Willison wrote. “As someone who develops against LLMs, interpretability and transparency are everything to me. The idea that I can execute a complex prompt and have important details about how that prompt was evaluated be hidden from me feels like a huge step backwards.”

Ban warnings fly back and forth as users dare to investigate the 'minds' of OpenAI's latest model

Hidden thought chains