The big AI companies promised us that 2025 would be “the year of AI agents.” It turned out to be the year of talk about AI agents, and kicking the can for that transformative moment to 2026 or maybe later. But what if the answer to the question “When will our lives be fully automated by generative AI robots that perform our tasks for us and essentially run the world?” is, like that New Yorker cartoon, “How about never?”
That was actually the message of an article published without much fanfare a few months ago, in the middle of the overhyped year of 'agentic AI'. Titled “Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models,” it attempts to demonstrate mathematically that “LLMs are incapable of performing computational and agentic tasks beyond a certain complexity.” While the science is beyond me, the authors – a former CTO of SAP who studied AI under one of the field's founders, John McCarthy, and his prodigy son – have punctured the vision of the agent's paradise with the certainty of mathematics. Even reasoning models that go beyond the pure word prediction process of LLMs won't solve the problem, they say.
“There is no way they can be trusted,” Vishal Sikka, the father, told me. After a career that, in addition to SAP, also included a stint as CEO of Infosys and board member of Oracle, he currently heads an AI services startup called Vianai. “So we should forget about AI agents controlling nuclear power plants?” I ask. “Exactly,” he says. You may be able to get him to fill out some paperwork or something to save time, but you may have to put up with some mistakes.
The AI industry takes a different view. For starters, coding is a big success in the field of agent AI, which took off last year. Just this week at Davos, Demis Hassabis, Google's Nobel Prize-winning head of AI, reported breakthroughs in minimizing hallucinations, and both hyperscalers and startups are pushing the agent narrative. Now they have a backup. A startup called Harmonic reports a breakthrough in AI coding that also relies on math – and tops the benchmarks reliability.
Harmonic, co-founded by Robinhood CEO Vlad Tenev and Tudor Achim, a Stanford-trained mathematician, claims that this recent improvement to its product called Aristotle (no hubris there!) is an indication that there are ways to ensure the reliability of AI systems. “Are we doomed to live in a world where AI just generates slop and humans can't really control it? That would be a crazy world,” says Achim. Harmonic's solution is to use formal methods of mathematical reasoning to verify the output of an LLM. Specifically, it encodes the output in the Lean programming language, which is known for its ability to verify the coding. To be fair, Harmonic's focus has been limited thus far: its main mission is the pursuit of “mathematical superintelligence,” and coding is a somewhat organic extension. Things like history essays – which cannot be verified mathematically – fall outside its boundaries. For now.
Nevertheless, Achim doesn't seem to think that reliable agent behavior is as big a problem as some critics think. “I would say that most models at this point have the level of pure intelligence required to reason when booking an itinerary,” he says.
Both sides are right – or maybe even on the same side. On the one hand, everyone agrees that hallucinations will remain an unpleasant reality. In a paper published last September, OpenAI scientists wrote: “Despite significant progress, hallucinations continue to plague the field and are still present in the latest models.” They proved this unfortunate claim by asking three models, including ChatGPT, to provide the title of the lead author's dissertation. All three made up false titles and all misstated the year of publication. In a blog about the article, OpenAI somberly stated that in AI models, “accuracy will never reach 100 percent.”
