Saturday one An Associated Press investigation has found that OpenAI's Whisper transcription tool is creating fabricated text in medical and business settings, despite warnings against such use. The AP interviewed more than a dozen software engineers, developers and researchers who found that the model regularly makes up text that speakers never said, a phenomenon often called a “confabulation” or “hallucination” in the AI world.
Upon release in 2022, OpenAI claimed that Whisper approached “human-level robustness” in terms of audio transcription accuracy. However, a University of Michigan researcher told the AP that Whisper created false text in 80 percent of public meeting transcripts examined. Another developer, not named in the AP report, claimed to have found fabricated content in nearly all of its 26,000 test transcripts.
The fabrications pose special risks in healthcare. Despite OpenAI's warnings against using Whisper for “high-risk domains,” more than 30,000 medical providers now use Whisper-based tools to transcribe patient visits, the AP report said. The Mankato Clinic in Minnesota and the Children's Hospital in Los Angeles are among 40 healthcare systems using a Whisper-powered AI copilot service from medical technology company Nabla, which is tailored to medical terminology.
Nabla acknowledges that Whisper can confabulate, but also reportedly deletes original audio recordings “for data security reasons.” This can cause additional problems because doctors cannot verify the accuracy of the source material. And deaf patients can suffer major consequences from incorrect transcriptions because they have no way of knowing whether the audio of medical transcriptions is accurate or not.
The potential problems with Whisper extend beyond healthcare. Researchers from Cornell University and the University of Virginia studied thousands of audio clips and found that Whisper added non-existent violent content and racist commentary to neutral speech. They found that 1 percent of the samples contained “entire hallucinatory sentences or sentences that did not appear in any form in the underlying audio” and that 38 percent of those contained “explicit harms, such as perpetuating violence, making up inaccurate associations, or implying of false authority'. .”
In one instance from the study cited by AP, when a speaker described “two other girls and one lady,” Whisper added fictional text specifying that they “were black.” In another, the audio said: “He, the boy, would, I'm not exactly sure, take the umbrella.” Whisper transcribed it as follows: “He took a big piece of a cross, a tiny piece… I'm sure he didn't have a terror knife, so he killed some people.”
A spokesperson for OpenAI told the AP that the company appreciates the researchers' findings and is actively investigating how to reduce fabrications and incorporating feedback into updates to the model.
Why Whisper Confabulates
The key to Whisper's unsuitability in high-risk domains stems from his tendency to sometimes fabricate inaccurate results or make up plausible ones. The AP report says, “Researchers aren't sure why Whisper and similar aids hallucinate,” but that's not true. We know exactly why Transformer-based AI models like Whisper behave the way they do.
Whisper is based on technology designed to predict the next most likely token (piece of data) that should appear after a series of tokens provided by a user. In the case of ChatGPT, the input tokens are in the form of a text prompt. In the case of Whisper, the input consists of tokenized audio data.