
Agents who used debugging -tools performed drastically better than those who didn't do it, but their success rate was still not high enough.
Credit: Microsoft Research
This approach is much more successful than trusting the models because they are usually used, but if your best case is a success rate of 48.4 percent, you are not ready for Primetime. The limitations are probably because the models do not fully understand how they can best use the tools, and because their current training data is not adjusted to this use case.
“We believe that this is due to the scarcity of data that represent sequential decision-making behavior (eg debugging tracks) in the current LLM Training Corpus,” says the blog post. “The important performance improvement … however, validates that this is a promising research direction.”
This first report is only the start of the efforts, the post claims. The next step is to “refine an info-seeking model that specializes in collecting the necessary information to solve bugs.” If the model is large, the best step can be to store inference costs to “build a smaller info seizing model that can offer relevant information to the larger ones.”
This is not the first time we have seen results that suggest that some of the ambitious ideas about AI agents who immediately replace replacements are quite far from reality. There have already been countless studies that have shown that, although an AI tool can sometimes make an application that seems acceptable to the user for a scary task, the models tend to produce code loaded with bugs and security vulnerabilities, and they cannot generally solve those problems.
This is an early step towards AI coding agents, but most researchers agree that it remains likely that the best result is an agent who saves a human developer a considerable amount of time, not one who can do everything they can do.