Anyone who has taken a basic computer science course has undoubtedly spent time devising a sorting algorithm – code that puts an unordered list of items in ascending or descending order. It’s an interesting challenge because there are so many ways to do this and because people have spent a lot of time figuring out how to sort this out as efficiently as possible.
Sorting is so simple that algorithms are built into most standard programming language libraries. And in the case of the C++ library used with the LLVM compiler, the code hasn’t been touched in over a decade.
But Google’s DeepMind AI group has now developed a reinforcement learning tool that can develop extremely optimized algorithms without first being trained on human code samples. The trick was to set it up so that programming was treated like a game.
It’s all a game
DeepMind is notable, among other things, for having developed software that teaches itself to play games. That approach has proven highly effective, conquering games as varied as chess, To goAnd StarCraft. While the details vary depending on the game being tackled, the software learns by playing itself and discovers options that allow it to maximize a score.
Because it’s not trained on games people play, the DeepMind system can discover approaches to the games that people haven’t thought of. Since it is always playing against itself, there are of course instances where it has developed blind spots for humans to exploit.
This approach is very relevant to programming. Large language models write effective code because they have seen many human examples. But that’s why they’re unlikely to develop something that humans haven’t done before. If we want to optimize well-understood algorithms, such as sorting functions, basing something on existing human code will yield equivalent performance at best. But how do you ensure that an AI identifies a truly new approach?
The folks at DeepMind took the same approach as they did with chess and To go: They turned code optimization into a game. The AlphaDev system developed x86 assembly algorithms that treated the latency of the code as a score and tried to minimize that score while ensuring that the code completed without errors. Through reinforcement learning, AlphaDev gradually develops the ability to write tight, highly efficient code.
Within AlphaDev
Saying the system is optimizing for latency is very different from explaining how it works. Like most other complex AI systems, AlphaDev consists of several separate components. One is a representation function, which tracks the overall performance of the code as it is developed. This includes the general structure of the algorithm, as well as the use of x86 registers and memory.
The system adds individual assembly instructions chosen by a Monte Carlo tree search – yet another approach borrowed from game-playing systems. The “tree” aspect of this approach allows the system to quickly narrow down to a limited area of ββthe large range of potential instructions, while the Monte Carlo adds a degree of arbitrariness to the precise instruction chosen from that branch. (Note that “instruction” in this context includes things like the specific registers chosen to make a valid and complete assembly.)
The system then evaluates the status of the assembly code for latency and validity and assigns it a score, which is compared to the score of the previous one. And, through reinforcement learning, it holds onto information about how going down different branches of the tree works, given the state of the program. Over time, it “learns” how to reach a winning game state – a completed sort – with a maximum score, meaning minimum latency.
The main advantage of this system is that the training does not have to include code samples. Instead, the system generates its own code samples and then evaluates them. In doing so, it adheres to information about combinations of instructions that are effective in sorting.