Skip to content

AI -Hallucinations get worse, even if new systems become more powerful

    Last month, an AI-Bot that deals with technical support for Cursor, an emerging tool for computer programmers, warned various customers about a change in company policy. It said they were no longer allowed to use cursor on more than just one computer.

    The customers complained in angry messages to internet boards. Some have canceled their cursor accounts. And some became even more angry when they realized what had happened: the AI ​​Bot had announced a policy change that did not exist.

    “We do not have such a policy. Of course you are free to use cursor on multiple machines,” wrote the Chief Executive and co-founder of the company, Michael Truell, in a Reddit message. “Unfortunately, this is an incorrect reaction from a front-line AI support bone.”

    More than two years after the arrival of Chatgpt, technology companies, office workers and everyday consumers use AI bots for an ever -wide range of tasks. But there is still no way to ensure that these systems produce accurate information.

    The newest and most powerful technologies-so-so-called reasoning systems from companies such as OpenAI, Google and the Chinese start-up Deepseek-generation more errors, no less. As their mathematical skills have improved, their handle has become more reluctant. It is not entirely clear why.

    Today's AI bots are based on complex mathematical systems that learn their skills by analyzing huge amounts of digital data. They cannot – and cannot decide what is true and what is incorrect. Sometimes they just come up with things, a phenomenon that some AI researchers call hallucinations. On one test were the hallucination rates of newer AI systems up to 79 percent.

    These systems use mathematical probabilities to guess the best reaction, not a strict series of rules defined by human engineers. So they make a certain number of errors. “Despite our efforts, they will always hallucinate,” said Amr Awadallah, the Chief Executive of Vectara, a start-up that builds AI tools for companies and a former Google director. “That will never disappear.”

    Since a few years, this phenomenon has expressed concern about the reliability of these systems. Although in some situations they are useful – such as writing deadlines, summarizing office documents and generating computer code – their mistakes can cause problems.

    The AI ​​bots that are bound by search engines such as Google and Bing sometimes generate search results that are laughable wrong. If you ask them for a good marathon on the west coast, they can present a race in Philadelphia. If they tell you the number of households in Illinois, they can call a source that does not contain that information.

    Those hallucinations may not be a big problem for many people, but it is a serious problem for anyone who uses the technology with judicial documents, medical information or sensitive business data.

    “You spend a lot of time trying to find out which answers are actually and which are not,” says Pratik Verma, co-founder and chief executive of Okahu, a company that helps companies navigate the hallucination problem. “Not dealing with these errors in principle eliminates the value of AI systems, which are supposed to automate tasks for you.”

    Cursor and Mr. Truell did not respond to requests for comments.

    For more than two years, companies such as OpenAI and Google steadily improved their AI systems and reduced the frequency of these errors. But mistakes rise with the use of new reasoning systems. The latest OpenAi systems hallucinating higher than the previous system of the company, according to the company's own tests.

    The company discovered that O3 – the most powerful system – 33 percent of the time hallucinated in performing its personqa -benchmarktest, where questions are answered about public figures. That is more than twice the hallucination percentage of the previous reasoning system of OpenAi, called O1. The new O4-Mini hallucinated with an even higher percentage: 48 percent.

    When performing another test called Simpleqa, which asks more general questions, the hallucination rates for O3 and O4-Mini were 51 percent and 79 percent. The previous system, O1, hallucinated 44 percent of the time.

    In a paper with details about the tests, OpenAi said that more research was needed to understand the cause of these results. Because AI systems learn from more data than people can wrap their heads, technologists struggle to determine why they behave in the ways they do.

    “Hallucinations are not more common in reasoning models, although we actively work to reduce the higher hallucination that we saw in O3 and O4-Mini,” said a company spokeswoman, Gaby Raila. “We will continue with our research into hallucinations in all models to improve accuracy and reliability.”

    Hannaneh Hajishirzi, professor at the University of Washington and a researcher at the Allen Institute for Artificial Intelligence, is part of a team that recently devised a way to trace the behavior of a system back to the individual data of data on which it was trained. But because systems learn from so much data – and because they can generate almost anything – this new tool cannot explain everything. “We still don't know exactly how these models work,” she said.

    Tests from independent companies and researchers indicate that hallucination rates are also rising for reasoning models from companies such as Google and Deepseek.

    Since the end of 2023, the company of Mr. Awadallah, Vectara, followed how often chatbots decrease from truth. The company asks these systems to perform a simple task that is easily verified: summarizes specific news items. Even then, chatbots constantly invent information.

    The original research by Vectara estimated that in this situation chatbots were the information at least 3 percent of the time and sometimes no less than 27 percent.

    Since then, companies such as OpenAI and Google have been pushing those figures in the reach of 1 or 2 percent. Others, such as the San Francisco start-up Anthropic, floated around 4 percent. But hallucination rates on this test have risen with reasoning systems. Deepseek's reasoning system, R1, hallucinated 14.3 percent of the time. OpenAi's O3 climbed to 6.8.

    (The New York Times has sued OpenAi and his partner, Microsoft, and accuses them of infringing copyright with regard to news content with regard to AI systems. OpenAi and Microsoft have denied those claims.)

    For years, companies such as OpenAI relied on a simple concept: the more internet data they have entered in their AI systems, the better these systems would perform. But they used just about all English text on the internet, which meant that they needed a new way to improve their chatbots.

    So these companies lean heavier on a technique that scientists call learning from reinforcement. With this process, a system can learn behavior through trial and error. It works well in certain areas, such as mathematics and computer programming. But it falls short in other areas.

    “The way in which these systems have been trained will start to concentrate on one task-and they will start forgetting others,” said Laura Perez-Beltrechini, a researcher at the University of Edinburgh, who belongs to a team that the hallucination problem is accurately investigating.

    Another problem is that reasoning models are designed to spend time on 'thinking' through complex problems before you establish an answer. While they try to tackle a problem step by step, they run the risk of hallucinating with every step. The mistakes can put together if they spend more time thinking.

    The newest bots reveal every step to users, which means that users can also see every error. Researchers have also discovered that in many cases the steps displayed by a bone are not related to the answer it ultimately provides.

    “What the system says it thinks it is not necessary what it thinks,” said Aryo Pradipta Gema, an AI researcher at the University of Edinburgh and a fellow at Anthropic.