Skip to content

AI models can obtain backdoors from surprisingly few malicious documents

    Refinement experiments with 100,000 clean samples versus 1,000 clean samples showed similar success rates when the number of malicious examples remained constant. For GPT-3.5 turbo, between 50 and 90 malicious samples achieved more than 80 percent attack success over dataset sizes of two orders of magnitude.

    Limits

    While it may initially seem alarming that LLMs could be compromised in this way, the findings only apply to the specific scenarios tested by the researchers and come with important caveats.

    “It remains unclear how far this trend will last as we continue to scale models,” Anthropic wrote in its blog post. “It is also unclear whether the same dynamics we observed here will apply to more complex behaviors, such as backdoor coding or bypassing safety railings.”

    The study only tested models up to 13 billion parameters, while the most capable commercial models contain hundreds of billions of parameters. The investigation also focused exclusively on simple backdoor behavior rather than the sophisticated attacks that would pose the greatest security risks in real-world deployments.

    Moreover, the loopholes can largely be addressed by the security training that companies already provide. After installing a backdoor with 250 bad examples, the researchers found that training the model with only 50 to 100 “good” examples (showing how to ignore the trigger) made the backdoor much weaker. With 2,000 good examples, the back door effectively disappeared. Because real AI companies use extensive security training with millions of examples, these simple backdoors may not survive in actual products like ChatGPT or Claude.

    The researchers also note that while it is easy to create 250 malicious documents, the more difficult problem for attackers is actually getting those documents into training datasets. Large AI companies manage their training data and filter the content, making it difficult to ensure specific malicious documents are included. An attacker who can guarantee that one malicious web page is included in training data can always make that page larger and contain more examples, but accessing managed datasets in the first place remains the main barrier.

    Despite these limitations, the researchers argue that their findings should change security practices. The work shows that defenders need strategies that work even when a small fixed number of malicious examples exist, rather than assuming they only need to worry about contamination based on percentages.

    “Our results suggest that injecting backdoors via data poisoning may be easier for large models than previously thought, because the number of poisons required does not grow with model size,” the researchers wrote, “highlighting the need for more research into defense mechanisms to mitigate this risk in future models.”