Demos of AI agents may seem stunning, but getting the technology to perform reliably in real life without annoying (or costly) errors can be a challenge. Today's models can answer questions and converse with near-human skills, and form the backbone of chatbots like OpenAI's ChatGPT and Google's Gemini. They can also perform tasks on computers when given a simple command by accessing the computer screen and input devices such as a keyboard and trackpad, or through low-level software interfaces.
Anthropic says Claude outperforms other AI agents on several key benchmarks, including SWE-bench, which measures an agent's software development skills, and OSWorld, which measures an agent's ability to use a computer operating system. The claims have yet to be independently verified. Anthropic says Claude completes tasks correctly 14.9 percent of the time in OSWorld. This is well below that of humans, who generally score around 75 percent, but significantly higher than today's best agents (including OpenAI's GPT-4) who pass roughly 7.7 percent of the time.
Anthropic claims that several companies are already testing Claude's agent version. This includes Canva, which uses it to automate design and editing tasks, and Replit, which uses the model for coding chores. Other early adopters include The Browser Company, Asana and Notion.
Ofir Press, a postdoctoral researcher at Princeton University who helped develop the SWE bench, says agentic AI often lacks the ability to plan far ahead and often has difficulty recovering from mistakes. “To show that they are useful, we need to achieve strong performance on difficult and realistic benchmarks,” he says, such as reliably planning a wide range of trips for a user and booking all necessary tickets.
Kaplan notes that Claude can already solve some mistakes surprisingly well. For example, when the model encountered a terminal error when starting a Web server, it knew how to revise the command to correct the problem. It also turned out that it had to enable pop-ups if it got stuck while browsing the web.
Many tech companies are now rushing to develop AI agents as they pursue market share and fame. In fact, it won't be long before many users will have agents at their fingertips. Microsoft, which has poured more than $13 billion into OpenAI, says it is testing agents that can use Windows computers. Amazon, which has invested heavily in Anthropic, is exploring how agents can recommend and ultimately purchase goods to its customers.
Sonya Huang, a partner at venture firm Sequoia that focuses on AI companies, says despite all the excitement surrounding AI agents, most companies are really just renaming AI-powered tools. Speaking to WIRED ahead of the Anthropic news, she says the technology currently works best when applied in limited domains, such as coding-related work. “You have to choose problem spaces where it's not a problem if the model fails,” she says. “Those are the problem spaces where true agent-native companies will emerge.”
A key challenge with agentic AI is that errors can be much more problematic than an unreadable chatbot response. Anthropic has placed certain restrictions on what Claude can do, such as limiting the ability to use someone's credit card to buy things.
If mistakes can be avoided well enough, says Princeton University's Press, users may be able to learn to see AI – and computers – in a whole new way. “I'm super excited about this new era,” he says.