AI chatbots fail to diagnose patients by talking with them

Don’t call your favorite AI “doctor” just yet

Just_Super/Getty Images

Advanced models of artificial intelligence score well in professional medical examinations, but still passed one of the most important tasks of a doctor: talking to patients to gather relevant medical information and provide an accurate diagnosis.

“While large speech models show impressive results in multiple-choice tests, their accuracy drops significantly in dynamic conversations,” he says. Pranav Rajpurkar at Harvard University. “Models in particular struggle with open-ended diagnostic reasoning.”

This became evident when researchers developed a method to assess the reasoning abilities of a clinical AI model based on doctor-patient conversations. The “patients” were based on 2000 medical cases, mainly drawn from US medical board examinations.

“Simulating patient interactions allows assessment of medical history-taking skills, a critical component of clinical practice that cannot be assessed using case vignettes,” he says. Shreya johriAlso at Harvard University. The new benchmark assessment, called CRAFT-MD, “mirrors real-life scenarios where patients are unsure of which details are important to share and may disclose relevant information only when prompted by specific questions.”

The CRAFT-MD benchmark itself is based on AI. OpenAI’s GPT-4 model played the role of a “patient AI” in conversation with the “clinical AI” being tested. The GPT-4 also helped to grade the results by comparing the clinical diagnosis of AI with the correct answer for each case. These assessments were double-checked by human medical experts. Interviews were also reviewed to check the accuracy of the patient’s AI and to see whether the clinical AI succeeded in gathering relevant medical information.

Multiple experiments showed that the top four major language models (OpenAI’s GPT-3.5 and GPT-4 models, Meta’s Llama-2-7b model, and Mistral AI’s Mistral-v2-7b model) performed significantly worse than the conversation-based benchmark. that it was making diagnoses based on written case summaries. OpenAI, Meta and Mistral AI did not respond to requests for comment.

For example, the GPT-4’s diagnostic accuracy was an impressive 82 percent when presented with structured case summaries and allowed to select a diagnosis from a multiple-choice list, compared to less than 49 percent when not. multiple choice options. When it had to make diagnoses from simulated patient interviews, however, its accuracy dropped to 26 percent.

And GPT-4 was the best AI model tested in the study, GPT-3.5 often came in second, the Mistral AI model sometimes ranked second or third, and Meta’s Llama model generally scored the lowest.

The AI models also failed to collect complete medical histories much of the time, and the leading GPT-4 model did so in only 71 percent of simulated patient interviews. Even though AI models collected a patient’s relevant medical history, they did not always produce correct diagnoses.

Such simulated patient interviews are a “much more useful” way to assess AI clinical reasoning skills than medical exams, he says. Eric Topol at the Scripps Research Translational Institute in California.

If an AI model can beat that benchmark by consistently making accurate diagnoses based on simulated patient interviews, it wouldn’t necessarily outperform human doctors, says Rajpurkar. He notes that medical practice in the real world is “more messy” than in simulations. It involves managing multiple patients, coordinating with health care teams, performing physical examinations, and understanding the “complex social and systemic factors” in local health situations.

“Our strong benchmark performance would suggest that AI can be a powerful tool to support clinical work, but it is not necessarily a substitute for the comprehensive judgment of experienced clinicians,” says Rajpurkar.

Topics:

Source link

What's Hot

Delicious Herbal Tea Recipes and Their Benefits

Car Bomb Kills Russian General in Moscow

Your Gut Bacteria Is Under Attack by Pesticides and Everyday Chemical Pollutants

AI chatbots fail to diagnose patients by talking with them

Electrical synapses genetically engineered in mammals for first time

Does Your Language’s Grammar Change How You Think?

This Butterfly’s Epic Migration Is Written into Its Chemistry

Tracking the Most Impactful Changes — ProPublica

Jack Grealish: How Man City’s £100m man has gone one year without a goal for the club | Football News

As the Super Bowl nears, New Orleans grapples with how safe is safe enough

Sicily’s hills were 40 metres below water during Earth’s megaflood

Why you don't need to worry about 'over-potting' your plants

Tom Brady Posts Thirst Trap During Fishing Outing

Most Popular

Why DeepSeek’s AI Model Just Became the Top-Rated App in the U.S.

New Music Friday February 14: SZA, Selena Gomez, benny blanco, Sabrina Carpenter, Drake, Jack Harlow and More

Why Time ‘Slows’ When You’re in Danger

Top Scholar Says Evidence for Special Education Inclusion is ‘Fundamentally Flawed’

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

What's Hot

AI chatbots fail to diagnose patients by talking with them

Related Posts

Oh hi there 👋It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

Oh hi there 👋
It’s nice to meet you.