I'm by no means an experienced programmer, but thanks to a free program called SWE-agent, I was just able to fix a nasty problem with a misnamed file in several code repositories on the software hosting site GitHub.
I pointed SWE-agent to an issue on GitHub and watched as it went through the code and figured out what could be wrong. It correctly determined that the root cause of the bug was a line pointing to the wrong location for a file, then navigated through the project, found the file, and changed the code to make it work. It’s the kind of thing that an inexperienced developer (like me) would spend hours trying to debug.
Many programmers are already using artificial intelligence to write software faster. GitHub Copilot was the first integrated development environment to use AI, but many IDEs now automatically complete chunks of code as a developer starts typing. You can also ask AI questions about code or make suggestions on how to improve what you're working on.
Last summer, John Yang and Carlos Jimenez, two Princeton PhD students, began discussing what it would take for AI to become a real software engineer. This led them and others at Princeton to come up with SWE-bench, a set of benchmarks for testing AI tools on a range of coding tasks. After the benchmark was released in October, the team developed its own tool, SWE-agent, to master these tasks.
SWE Agent (SWE is short for software engineering) is one of several significantly more powerful AI coder programs that go beyond simply writing lines of code and act as so-called software agents, wielding the tools needed to manage, debug, and organize software. The startup Devin went viral with a video demo of such a tool in March.
Ofir Press, a member of the Princeton team, says SWE-bench can help OpenAI test the performance and reliability of software agents. “It’s just my opinion, but I think they’ll release a software agent soon,” Press says.
OpenAI declined to comment, but another source with knowledge of the company's operations, who asked to remain anonymous, told WIRED that “OpenAI is definitely working on coding agents.”
Just as GitHub Copilot showed that large language models can write code and increase programmer productivity, tools like SWE-agent can prove that AI agents can work reliably, starting with building and maintaining code.
A number of companies are testing agents for software development. At the top of the SWE Bench leaderboard, which measures the scores of different coding agents for different tasks, is one from startup Factory AI, followed by AutoCodeRover, an open-source entry from a team at the National University of Singapore.
Big players are jumping in, too. A software writing tool called Amazon Q is another top performer on the SWE bench. “Software development is a lot more than just typing,” says Deepak Singh, vice president of software development at Amazon Web Services.
He adds that AWS has used the agent to translate entire software stacks from one programming language to another. “It’s like having a really smart engineer sitting next to you, writing and building an application with you,” Singh says. “I think that’s pretty transformative.”
A team at OpenAI recently helped the Princeton team improve a benchmark for measuring the reliability and effectiveness of tools like SWE-agent. This suggests that the company may also be improving agents for writing code or performing other tasks on a computer.
Singh says that a number of customers are already building complex backend applications with Q. My own experiments with SWE-bench suggest that anyone who codes will soon want to use agents to improve their programming skills, or risk falling behind.