In the ancient Chinese game of To goState-of-the-art artificial intelligence has generally been able to beat the best human players since at least 2016. But in recent years, researchers have discovered flaws in this top-level AI To go algorithms that give humans a fighting chance. By using unorthodox “cyclical” strategies that even a novice human player could detect and defeat, a wily human can often exploit holes in a top-level AI's strategy and drive the algorithm to a loss.
Researchers from MIT and FAR AI wanted to see if they could improve on this “worst-case” performance in otherwise “superhuman” AI Go algorithms, by testing a trio of methods to strengthen the KataGo algorithm’s top-level defenses against adversarial attacks. The results show that creating truly robust, non-exploitable AIs can be difficult, even in domains as tightly controlled as board games.
Three failed strategies
In the pre-print article “Can To go “Are AIs adversarially robust?”, the researchers want to ask. To go AI that is truly “robust” against all attacks. That means an algorithm that cannot be fooled into “game-losing blunders that a human would not make,” but also an algorithm that any competing AI algorithm would require significant computing resources to beat. Ideally, a robust algorithm should also be able to overcome potential exploits by using additional computing resources when faced with unfamiliar situations.
The researchers tried three methods to create such a robust To go algorithm. In the first, they simply refined the KataGo model using more examples of the unorthodox cyclical strategies that had previously defeated it, hoping that KataGo would learn to detect and beat these patterns after seeing more of them.
This strategy initially seemed promising, allowing KataGo to win 100 percent of games against a cyclical “attacker.” But after the attacker itself was refined (a process that required much less computing power than KataGo's fine-tuning), that win rate dropped back to 9 percent against a slight variation on the original attack.
For the second defense attempt, the researchers repeated a multi-round “arms race,” in which new adversarial models discover new exploits and new defensive models attempt to plug those newly discovered holes. After 10 rounds of such iterative training, the final defensive algorithm still won only 19 percent of games against a final attacking algorithm that had discovered a previously unseen variation of the exploit. This was the case even when the updated algorithm maintained an advantage over previous attackers it had been trained against in the past.
In their latest attempt, the researchers tried an entirely new type of training using vision transformers, in an attempt to avoid what may be “bad inductive biases” found in the convolutional neural networks that initially trained KataGo. This method also failed, winning only 22 percent of the time against a variation of the cyclic attack that “can be replicated by a human expert,” the researchers wrote.
Will anything work?
In all three defense attempts, the KataGo-defeating opponents did not represent a new, previously unseen height in general To go-playing ability. Instead, these attacking algorithms were laser-focused on finding exploitable weaknesses in an otherwise performant AI algorithm, even if those simple attack strategies would lose to most human players.
These exploitable holes underscore the importance of evaluating “worst-case” performance in AI systems, even when “average” performance can seem downright superhuman. On average, KataGo can dominate even high-level human players using traditional strategies. But in the worst case, otherwise “weak” opponents can find holes in the system that cause it to fall apart.
It is easy to extend this kind of thinking to other kinds of generative AI systems. LLMs that can succeed in complex creative and referencing tasks can still fail miserably when confronted with trivial mathematical problems (or even be “poisoned” by malicious prompts). Visual AI models that can describe and analyze complex photographs can nevertheless fail miserably when confronted with simple geometric shapes.
Improving on these kinds of “worst-case” scenarios is essential for avoiding embarrassing mistakes when rolling out an AI system to the public. But this new research shows that determined “adversaries” can often find new holes in an AI algorithm’s performance much more quickly and easily than the algorithm can evolve to fix them.
And if that is true in To go—a monstrously complex game that nevertheless has tightly defined rules—it could be even more true in less controlled environments. “The key lesson for AI is that these vulnerabilities are going to be hard to eliminate,” FAR CEO Adam Gleave told Nature . “If we can’t solve the problem in a simple domain like To gothen it seems unlikely in the short term that similar issues, such as jailbreaks in ChatGPT, will be fixed.”
Still, the researchers are not desperate. Although none of their methods were able to ” [new] attacks impossible” in To goTheir strategies were able to close immutable “fixed” exploits that had been previously identified. This suggests that “it is possible to To go “AI by training against a sufficiently large corpus of attacks,” they write, proposing future research that could make this possible.
In any case, this new research shows that making AI systems more resilient to worst-case scenarios could be at least as valuable as pursuing new, more human/superhuman skills.