The patient was a 39-year-old woman who had come to the emergency department of Beth Israel Deaconess Medical Center in Boston. Her left knee had been hurting for days. The day before she had a fever of 102 degrees. It was gone now, but she still had chills. And her knee was red and swollen.
What was the diagnosis?
On a recent steamy Friday, Dr. Megan Landon, a medical assistant, presents this real case to a roomful of medical students and residents. They had gathered to learn a skill that can be devilishly difficult to teach: how to think like a doctor.
“Doctors are terrible at teaching other doctors how we think,” says Dr. Adam Rodman, an internist, medical historian, and event organizer at Beth Israel Deaconess.
But this time, they could turn to an expert for help diagnosing it: GPT-4, the latest version of a chatbot released by the OpenAI company.
Artificial intelligence is transforming many aspects of medical practice and some medical professionals are using these tools to help them make a diagnosis. Doctors at Beth Israel Deaconess, a teaching hospital affiliated with Harvard Medical School, decided to investigate how chatbots could be used — and abused — in educating future doctors.
Instructors like Dr. Rodman hopes medical students can turn to GPT-4 and other chatbots for something akin to what doctors call an out-of-home consultation: when they pull a colleague aside and ask for an opinion on a difficult case. The idea is to deploy a chatbot in the same way doctors knock on each other for suggestions and insights.
For more than a century, doctors have been portrayed as detectives who collect clues and use them to find the culprit. But experienced doctors actually use a different method – pattern recognition – to figure out what’s wrong. In medicine, it’s called a disease script: signs, symptoms, and test results that doctors piece together to tell a cohesive story based on similar cases they know of or have seen themselves.
If the illness script doesn’t help, Dr. Rodman, doctors are turning to other strategies, such as assigning probabilities to different diagnoses that might fit.
For more than half a century, researchers have been trying to design computer programs to make medical diagnoses, but nothing has really worked out.
Doctors say GPT-4 is different. “It will create something remarkably similar to a disease script,” said Dr. Rodman. In that way, he added, “it’s fundamentally different from a search engine.”
Dr. Rodman and other doctors at Beth Israel Deaconess asked GPT-4 about possible diagnoses in difficult cases. In a study published last month in the medical journal JAMA, they found it outperformed most doctors on weekly diagnostic challenges published in The New England Journal of Medicine.
But, they learned, there is an art to using the program and there are pitfalls.
Dr. Christopher Smith, the director of the medical center’s internal medicine residency program, said medical students and residents “definitely use it.” But, he added, “whether they learn anything is an open question.”
The concern is that they could rely on AI to make diagnoses just as they would rely on a calculator on their phone to solve a math problem. That, Dr. Smith said, is dangerous.
Learning, he said, involves trying to figure things out: “That’s how we keep things. Part of learning is the struggle. If you outsource the learning to GPT, that struggle is over.”
During the meeting, students and residents split into groups and tried to figure out what was wrong with the patient with the swollen knee. Then they switched to GPT-4.
The groups tried different approaches.
People used GPT-4 to search the web, similar to how they would use Google. The chatbot spat out a list of possible diagnoses, including trauma. But when the group members asked him to explain his reasoning, the bot disappointed and explained his choice by saying, “Trauma is a common cause of knee injuries.”
Another group came up with possible hypotheses and asked GPT-4 to check them. The chatbot’s list matched the group’s: infections, including Lyme disease; arthritis, including gout, a form of arthritis involving crystals in the joints; and trauma.
GPT-4 added rheumatoid arthritis to the top possibilities, although it was not high on the group’s list. Gout, instructors later told the group, was unlikely for this patient because she was young and female. And rheumatoid arthritis could probably be ruled out because only one joint was inflamed, and only for a few days.
GPT-4 seemed to pass the test, or at least agree with the students and residents. But in this exercise, it offered no insights and no disease script.
One reason may be that the students and residents used the bot more as a search engine than a consultation agency.
To use the bot correctly, the instructors said, they would have to start by telling GPT-4 something like, “You’re a doctor seeing a 39-year-old woman with knee pain.” They would then have to list her symptoms before asking for a diagnosis and questioning the bot’s reasoning, as they would with a medical colleague.
That, the instructors said, is one way to harness the power of GPT-4. But it’s also crucial to recognize that chatbots can make mistakes and “hallucinate” – giving answers without any factual basis. To use them, you need to know when it is incorrect.
“It’s not wrong to use these tools,” says Dr. Byron Crowe, an internist at the hospital. “You just have to use them the right way.”
He gave the group an analogy.
“Pilots use GPS,” said Dr. Crowe. But, he added, airlines “have a very high standard of reliability.” In medicine, he said, using chatbots is “very tempting,” but the same high standards should apply.
“It’s a great thinking partner, but it doesn’t replace deep mental expertise,” he said.
When the session ended, the instructors revealed the true reason for the patient’s swollen knee.
It turned out to be a possibility that each group had considered and suggested by GPT-4.
She had Lyme disease.
Olivia Allison contributed reporting.