Preparing cancer patients for difficult decisions is the job of an oncologist. They don’t always remember to do it, though. At the University of Pennsylvania Health System, doctors are encouraged to talk about a patient’s treatment and end-of-life preferences by an artificially intelligent algorithm that predicts the chances of death.
But it’s far from a set-it-and-forget-it tool. A routine tech study revealed that the algorithm was corrupted during the covid-19 pandemic, and was 7 percentage points worse at predicting who would die, according to a 2022 study.
There were likely real-life impacts. Ravi Parikh, an Emory University oncologist who was the study’s lead author, told KFF Health News that the tool failed hundreds of times to prompt doctors to start this important discussion, possibly steering them away from unnecessary chemotherapy, with patients who needed it.
About supporting science journalism
If you like this article, please consider supporting our award-winning journalism subscribe. By purchasing a subscription, you’re helping to ensure a future of impactful stories about the discoveries and ideas that shape our world.
He believes that many algorithms designed to improve medical care have weakened during the pandemic, not just at Penn Medicine. “Many organizations don’t routinely monitor the performance of their products,” Parikh said.
Algorithmic errors are one aspect of a dilemma that computer scientists and doctors have long recognized, but hospital managers and researchers are beginning to wonder: AI systems require consistent monitoring and staffing to get them up and running.
Basically: you need people, and more machines, to make sure the new tools don’t get confused.
“Everybody thinks that AI is going to help us in our access and capacity and improve care and so on,” said Nigam Shah, chief data scientist at Stanford Health Care. “That’s all well and good, but if it increases the cost of care by 20%, is it viable?”
Government officials worry that hospitals lack the resources to put these technologies through their paces. “I’ve looked far and wide,” FDA Commissioner Robert Califf said at a recent agency panel on AI. “I don’t think there is a single health system in the United States that is capable of validating an AI algorithm that is implemented into a clinical care system.”
AI is already widespread in healthcare. Algorithms are used to predict patients’ risk of death or deterioration, suggest patient diagnoses or triage, record and shorten visits to save doctors work, and approve insurance claims.
If the technology evangelists are right, the technology will be ubiquitous and profitable. Investment firm Bessemer Venture Partners has identified some 20 health-focused AI startups on track to reach $10 million in annual revenue. The FDA has approved nearly a thousand artificially intelligent products.
Assessing whether these products work or not is difficult. Assessing whether they still work—or have developed the software equivalent of a blown gasket or leaky engine—is even more difficult.
Take a recent study in Yale Medicine evaluating six “early warning systems” that alert doctors when patients are rapidly deteriorating. A supercomputer ran the data over several days, said Dana Edelson, Ph.D., of the University of Chicago and founder of a company that provided an algorithm for the study. The process was productive, showing significant performance differences between the six products.
It is not easy for hospitals and providers to select the best algorithms for their needs. The average doctor doesn’t have a supercomputer sitting around, and there’s no Consumer Reports for AI.
“We don’t have a standard,” said Jesse Ehrenfeld, president of the American Medical Association. “There’s nothing that I can point to today that is a standard for how you evaluate, monitor, how you see the performance of an algorithm, AI-enabled or not, when it’s deployed.”
Perhaps the most common AI product in doctors’ offices is called ambient documentation, a technology-enabled assistant that listens and summarizes patient visits. Last year, Rock Health investors tracked $353 million to those document companies. But, Ehrenfeld said, “There is currently no standard against which to compare the output of these instruments.”
And that’s a problem, when even small mistakes can be devastating. A Stanford University team tried to use large language models – the technology behind popular AI tools like ChatGPT – to summarize patients’ medical histories. They compared the results to what a doctor would write.
“Even in the best case, the models had a 35% error rate,” said Stanford’s Shah. In medicine, “when you’re writing an abstract and you forget a word, like ‘fever,’ I mean, that’s a problem, right?”
Sometimes the reasons why algorithms fail are quite logical. For example, changes in underlying data can degrade efficiency, such as when hospitals change lab providers.
Sometimes, however, yawns open up for no apparent reason.
Sandy Aronson, director of technology for the personalized medicine program at Brigham, Mass. General in Boston, said that when her team tested an app to help genetic counselors find relevant literature on DNA variants, the product suffered from “non-determinism,” meaning that when asked the same thing. the question several times in a short period of time, it gave different results.
Aronson is excited about the potential of large language patterns to summarize knowledge for overloaded genetic counselors, but “the technology needs to improve.”
If measures and standards are poor and errors can occur for strange reasons, what should organizations do? Invest a lot of resources. At Stanford, Shah said, it took eight to 10 months and 115 man-hours to audit two models for correctness and reliability.
Experts interviewed by KFF Health News expanded on the idea of artificial intelligence monitoring artificial intelligence, with some (human) data controlling both. All agreed that it would require organizations to spend more money — a tough call given the realities of hospital budgets and the limited supply of AI technology specialists.
“It’s great to have the perspective that we’re melting icebergs to have a model that controls their pattern,” Shah said. “But is that what I really wanted? How many more people will we need?”
KFF Health Newsformerly known as Kaiser Health News (KHN), is a national newsroom that produces in-depth journalism on health issues and is one of the leading program operators. KFF — Independent source for health policy research, polling and journalism.