When security researcher Johann Rehberger recently reported a vulnerability in ChatGPT that could allow attackers to store false information and malicious instructions in a user's long-term memory settings, OpenAI immediately closed the investigation and labeled the vulnerability a security issue, and not technically a security issue.
So Rehberger did what all good researchers do: He created a proof-of-concept exploit that used the vulnerability to exfiltrate all user input forever. OpenAI engineers took note and released a partial fix earlier this month.
A walk down memory lane
The vulnerability exploited long-term memory of conversations, a feature that OpenAI began testing in February and made more broadly available in September. Memory with ChatGPT stores information from past conversations and uses it as context in all future conversations. That way, the LLM can be aware of details like a user’s age, gender, philosophical beliefs, and pretty much everything else, so those details don’t have to be typed in during every conversation.
Within three months of deployment, Rehberger discovered that memories could be created and permanently stored via indirect prompt injection, an AI exploit that causes an LLM to follow instructions from untrusted content such as emails, blog posts, or documents. The researcher showed how he could trick ChatGPT into believing that a target user was 102 years old, living in the Matrix, and insisting that the Earth was flat, and that the LLM would use that information to control all future conversations. These false memories could be planted by saving files to Google Drive or Microsoft OneDrive, uploading images, or browsing a site like Bing—all of which could be created by a malicious attacker.
Rehberger privately reported the finding to OpenAI in May. That same month, the company closed the report ticket. A month later, the researcher filed another disclosure statement. This time, he added a PoC that caused the ChatGPT app for macOS to send a literal copy of all user input and ChatGPT output to a server of its choosing. All a target had to do was instruct the LLM to view a web link hosting a malicious image. From that point on, all input and output to and from ChatGPT would be sent to the attacker’s website.
“What's really interesting is that this is now memory-persistent,” Rehberger said in the video demo above. “The prompt injection has put a memory in the ChatGPT long-term storage. When you start a new conversation, it's actually still exfiltrating the data.”
The attack is not possible through the ChatGPT web interface, thanks to an API that OpenAI rolled out last year.
While OpenAI has introduced a workaround that prevents memories from being abused as an exfiltration vector, untrusted content can still perform prompt injections that cause the memory tool to store long-term information placed by a malicious attacker, the researcher said.
LLM users looking to prevent this form of attack should pay close attention during sessions for output indicating new memory has been added. They should also regularly check stored memories for anything that may have been planted by untrusted sources. OpenAI offers guidance here on how to manage the memory tool and specific memories stored within it. Company representatives did not respond to an email asking about efforts to prevent other hacks that plant false memories.