
Don’t call your favorite AI “doctor” just yet
Just_Super/Getty Images
Advanced models of artificial intelligence score well in professional medical examinations, but still passed one of the most important tasks of a doctor: talking to patients to gather relevant medical information and provide an accurate diagnosis.
“While large speech models show impressive results in multiple-choice tests, their accuracy drops significantly in dynamic conversations,” he says. Pranav Rajpurkar at Harvard University. “Models in particular struggle with open-ended diagnostic reasoning.”
This became evident when researchers developed a method to assess the reasoning abilities of a clinical AI model based on doctor-patient conversations. The “patients” were based on 2000 medical cases, mainly drawn from US medical board examinations.
“Simulating patient interactions allows assessment of medical history-taking skills, a critical component of clinical practice that cannot be assessed using case vignettes,” he says. Shreya johriAlso at Harvard University. The new benchmark assessment, called CRAFT-MD, “mirrors real-life scenarios where patients are unsure of which details are important to share and may disclose relevant information only when prompted by specific questions.”
The CRAFT-MD benchmark itself is based on AI. OpenAI’s GPT-4 model played the role of a “patient AI” in conversation with the “clinical AI” being tested. The GPT-4 also helped to grade the results by comparing the clinical diagnosis of AI with the correct answer for each case. These assessments were double-checked by human medical experts. Interviews were also reviewed to check the accuracy of the patient’s AI and to see whether the clinical AI succeeded in gathering relevant medical information.
Multiple experiments showed that the top four major language models (OpenAI’s GPT-3.5 and GPT-4 models, Meta’s Llama-2-7b model, and Mistral AI’s Mistral-v2-7b model) performed significantly worse than the conversation-based benchmark. that it was making diagnoses based on written case summaries. OpenAI, Meta and Mistral AI did not respond to requests for comment.
For example, the GPT-4’s diagnostic accuracy was an impressive 82 percent when presented with structured case summaries and allowed to select a diagnosis from a multiple-choice list, compared to less than 49 percent when not. multiple choice options. When it had to make diagnoses from simulated patient interviews, however, its accuracy dropped to 26 percent.
And GPT-4 was the best AI model tested in the study, GPT-3.5 often came in second, the Mistral AI model sometimes ranked second or third, and Meta’s Llama model generally scored the lowest.
The AI models also failed to collect complete medical histories much of the time, and the leading GPT-4 model did so in only 71 percent of simulated patient interviews. Even though AI models collected a patient’s relevant medical history, they did not always produce correct diagnoses.
Such simulated patient interviews are a “much more useful” way to assess AI clinical reasoning skills than medical exams, he says. Eric Topol at the Scripps Research Translational Institute in California.
If an AI model can beat that benchmark by consistently making accurate diagnoses based on simulated patient interviews, it wouldn’t necessarily outperform human doctors, says Rajpurkar. He notes that medical practice in the real world is “more messy” than in simulations. It involves managing multiple patients, coordinating with health care teams, performing physical examinations, and understanding the “complex social and systemic factors” in local health situations.
“Our strong benchmark performance would suggest that AI can be a powerful tool to support clinical work, but it is not necessarily a substitute for the comprehensive judgment of experienced clinicians,” says Rajpurkar.
Topics:
 
		