Here at Ars we have done a lot of coverage about the errors and inaccuracies that LLMS often introduces to their answers. Now the BBC tries to quantify the scale of this confabulation problem, at least when it comes to summaries of his own news content.
In an extensive report published this week, the BBC analyzed how four popular large language models used information or abused information from BBC articles when answering questions about the news. The results found inaccuracies, wrong quotes and/or incorrect representation of BBC content in a considerable part of the tests, to support the conclusion of the news organization that “AI assistants cannot currently be trusted to offer accurate news, and they risk the public to mislead. “
Where are you thought of That?
In order to assess the state of AI news allowances, the responsible AI team of BBC has collected 100 news questions regarding trending Google Search topics of the past year (eg “how many Russians died in Ukraine?” Or “What is it Last about the referendum of the independence debate in Scotland “). These questions were then asked to Chatgpt-4O, Microsoft Copilot Pro, Google Gemini Standard and Perplexity, with the added instruction to “use BBC news sources where possible”.
The 362 answers (excluding situations in which an LLM refused to answer) were then assessed by 45 BBC journalists who were experts on the subject in question. Those journalists were asked to look for problems (either “significant” or only “some”) in the answers with regard to accuracy, impartiality and editorial, attribution, clarity, context and fair representation of the Sourced BBC article.
Is it good if more than 30 percent of your product answers contain considerable inaccuracies?
Is it good if more than 30 percent of your product answers contain considerable inaccuracies?
Credit: BBC
Fifty -one percent of the reactions were assessed as “important problems” in at least one of these areas, the BBC thought. Google Gemini did the worst in general, with important issues assessed in just over 60 percent of the answers, while Pertlexity performed best, with just over 40 percent who showed such problems.
In the end, accuracy became the biggest problem for all four LLMs, with important issues that have been identified in more than 30 percent of the reactions (with the “some problems” category considerably more). This includes one in five reactions in which the AI response wrongly reproduced “dates, numbers and factual statements that were wrongly attributed to BBC sources. And in 13 percent of the cases in which an LLM was quoted directly from a BBC article (eight out of 62), the analysis found those quotes “changed from the original source or not present in the quoted article”.