Advanced AI Models Are More Likely To Lie

In other words, if a human did not know whether an answer was correct, he would not be able to punish incorrect but convincing-sounding answers.

Schellaert's team examined three major families of modern LLMs: Open AI's ChatGPT, the LLaMA series developed by Meta, and the BLOOM suite created by BigScience. They discovered what is called ultracrepidarianism, the tendency to express opinions about things we know nothing about. It started appearing in the AIs as a result of increasing scale, but it was predictably linear and grew with the amount of training data, across all AIs. Supervised feedback “had a worse, more extreme effect,” says Schellaert. The first model in the GPT family to almost completely stop avoiding questions it had no answers to was text-davinci-003. It was also the first GPT model trained with reinforcement learning from human feedback.

The AIs lie because we told them this was worth it. An important question is when and how often we are lied to.

Make it harder

To answer this question, Schellaert and his colleagues created a series of questions in different categories, such as science, geography and mathematics. They then rated those questions based on how difficult they were for people to answer, on a scale of 1 to 100. The questions were then entered into subsequent generations of LLMs, starting from the oldest to the newest. The AIs' answers were classified as correct, incorrect, or evasive, meaning the AI refused to answer.

The first finding was that the questions that seemed harder for us also turned out to be harder for the AIs. The latest versions of ChatGPT gave correct answers to almost all science-related questions and most geography-oriented questions, until they got around 70 on Schellaert's difficulty level. Addition was more problematic, with the frequency of correct answers dropping dramatically after the difficulty level rose above 40. “Even for the best models, the GPTs, the failure rate on the most difficult addition questions is more than 90 percent. Ideally, we hope to see some avoidance here, right?” says Schellaert. But we didn't see much avoidance.

Advanced AI models are more likely to lie

Make it harder