AI’s Errors May Be Impossible to Eliminate – What That Means for Its Use in Health Care

Yves here. Yours truly is very much opposed to using AI in medicine, as in my care, and thus am delighted to be in a country where AI uptake is slow. IM Doc’s horror stories about its rampant screw-ups in the simple task of taking notes on patient sessions are enough to give one pause. And the inaccurate record is still in doctorese, so it’s not as if anyone but the initial MD could catch what was wrong).

I suspect I am not alone in suffering from sub-clinical conditions, which means what data there is on them tends to relegate them to not being worth bothering about. But those can and do add up. I pointed out to the top surgeon who did my hips that my orthopedic anomalies, while individually looking not serious, collectively put me way out of band. He sat bolt upright and agreed emphatically, as if he had yet to put together so tersely what was up with my structure.

I similarly have had some long-standing endocrine test results that are just not supposed to happen, and also don’t get pain relief from opiates. So I don’t trust AI to deal with what searches have confirmed as peculiarities. I have a sneaking suspicion that a decent proportion of the population (anywhere from 5% to as high as 15%) is oddball enough in ways that matter so as to not be well suited to the tender ministrations of AI care.

This article discusses another big problem with AI treatments, that there will always be errors due to what the author describes as categorization issues. Many ailments have large overlaps in symptoms, like scleroderma versus dermatomyositis (both very nasty skin-afflicting auto-immune diseases). I am sure many here, both as patients and practitioners, can list cases of mis-diagnosis where the error and/or delay made a difference in outcomes. An doctor in the old world might be willing to listen to a patient when the initial “no real problem” or incorrect reading increasingly looked wrong (as in patient got worse).

The article tries to propose a “best of all possible worlds” scenario where a doctor gets AI recommendations and then reviews them. That is likely to wind up being the worst of all possible worlds. The Lancet reported that doctors who used AI assistance in colonoscopies became worse at identifying potentially dangerous polyps on their own. More generally, studies are finding that regular use of ChatGPT, such as for writing essays, results in changes in brain activity as measured in EEGs. For instance, one found that “Brain connectivity systematically scaled down with the amount of external support.”

By Carlos Gershenson, Professor of Innovation, Binghamton University, State University of New York. Originally published at The Conversation

n the past decade, AI’s success has led to uncurbed enthusiasm and bold claims – even though users frequently experience errors that AI makes. An AI-powered digital assistant can misunderstand someone’s speech in embarrassing ways, a chatbot could hallucinate facts, or, as I experienced, an AI-based navigation tool might even guide drivers through a corn field – all without registering the errors.

People tolerate these mistakes because the technology makes certain tasks more efficient. Increasingly, however, proponents are advocating the use of AI – sometimes with limited human supervision – in fields where mistakes have high cost, such as health care. For example, a bill introduced in the U.S. House of Representatives in early 2025 would allow AI systems to prescribe medications autonomously. Health researchers as well as lawmakers since then have debated whether such prescribing would be feasible or advisable.

How exactly such prescribing would work if this or similar legislation passes remains to be seen. But it raises the stakes for how many errors AI developers can allow their tools to make and what the consequences would be if those tools led to negative outcomes – even patient deaths.

As a researcher studying complex systems, I investigate how different components of a system interact to produce unpredictable outcomes. Part of my work focuses on exploring the limits of science – and, more specifically, of AI.

Over the past 25 years I have worked on projects including traffic light coordination, improving bureaucracies and tax evasion detection. Even when these systems can be highly effective, they are never perfect.

For AI in particular, errors might be an inescapable consequence of how the systems work. My lab’s research suggests that particular properties of the data used to train AI models play a role. This is unlikely to change, regardless of how much time, effort and funding researchers direct at improving AI models.

Nobody – and Nothing, Not Even AI – Is Perfect

As Alan Turing, considered the father of computer science, once said: “If a machine is expected to be infallible, it cannot also be intelligent.” This is because learning is an essential part of intelligence, and people usually learn from mistakes. I see this tug-of-war between intelligence and infallibility at play in my research.

In a study published in July 2025, my colleagues and I showed that perfectly organizing certain datasets into clear categories may be impossible. In other words, there may be a minimum amount of errors that a given dataset produces, simply because of the fact that elements of many categories overlap. For some datasets – the core underpinning of many AI systems – AI will not perform better than chance.

For example, a model trained on a dataset of millions of dogs that logs only their age, weight and height will probably distinguish Chihuahuas from Great Danes with perfect accuracy. But it may make mistakes in telling apart an Alaskan malamute and a Doberman pinscher, since different individuals of different species might fall within the same age, weight and height ranges.

This categorizing is called classifiability, and my students and I started studying it in 2021. Using data from more than half a million students who attended the Universidad Nacional Autónoma de México between 2008 and 2020, we wanted to solve a seemingly simple problem. Could we use an AI algorithm to predict which students would finish their university degrees on time – that is, within three, four or five years of starting their studies, depending on the major?

We tested several popular algorithms that are used for classification in AI and also developed our own. No algorithm was perfect; the best ones − even one we developed specifically for this task − achieved an accuracy rate of about 80%, meaning that at least 1 in 5 students were misclassified. We realized that many students were identical in terms of grades, age, gender, socioeconomic status and other features – yet some would finish on time, and some would not. Under these circumstances, no algorithm would be able to make perfect predictions.

You might think that more data would improve predictability, but this usually comes with diminishing returns. This means that, for example, for each increase in accuracy of 1%, you might need 100 times the data. Thus, we would never have enough students to significantly improve our model’s performance.

Additionally, many unpredictable turns in lives of students and their families – unemployment, death, pregnancy – might occur after their first year at university, likely affecting whether they finish on time. So even with an infinite number of students, our predictions would still give errors.

The Limits of Prediction

To put it more generally, what limits prediction is complexity. The word complexity comes from the Latin plexus, which means intertwined. The components that make up a complex system are intertwined, and it’s the interactions between them that determine what happens to them and how they behave.

Thus, studying elements of the system in isolation would probably yield misleading insights about them – as well as about the system as a whole.

Take, for example, a car traveling in a city. Knowing the speed at which it drives, it’s theoretically possible to predict where it will end up at a particular time. But in real traffic, its speed will depend on interactions with other vehicles on the road. Since the details of these interactions emerge in the moment and cannot be known in advance, precisely predicting what happens to the the car is possible only a few minutes into the future.


AI is already playing an enormous role in health care.< Not With My Health

These same principles apply to prescribing medications. Different conditions and diseases can have the same symptoms, and people with the same condition or disease may exhibit different symptoms. For example, fever can be caused by a respiratory illness or a digestive one. And a cold might cause cough, but not always.

This means that health care datasets have significant overlaps that would prevent AI from being error-free.

Certainly, humans also make errors. But when AI misdiagnoses a patient, as it surely will, the situation falls into a legal limbo. It’s not clear who or what would be responsible if a patient were hurt. Pharmaceutical companies? Software developers? Insurance agencies? Pharmacies?

In many contexts, neither humans nor machines are the best option for a given task. “Centaurs,” or “hybrid intelligence” – that is, a combination of humans and machines – tend to be better than each on their own. A doctor could certainly use AI to decide potential drugs to use for different patients, depending on their medical history, physiological details and genetic makeup. Researchers are already exploring this approach in precision medicine.

But common sense and the precautionary principle
suggest that it is too early for AI to prescribe drugs without human oversight. And the fact that mistakes may be baked into the technology could mean that where human health is at stake, human supervision will always be necessary.

Print Friendly, PDF & Email

25 comments

  1. Andii

    The symptom overlap issue puts me in mind of that TV series ‘House’. The whole plot of each episode was precisely the difficulty of diagnosis because of the multivalency of symptoms and the difficulties of knowing what to attend to and judge salient and what might be misleading. AI misdiagnosis would be a case of anti-House.

    Reply
  2. fjallstrom

    I have a sneaking suspicion that a decent proportion of the population (anywhere from 5% to as high as 15%) is oddball

    Not a bad estimate at all. A study a decade or so ago landed on 7% having a rare disease, which is defined as 1 in 2000 or less (0,05% of the population). There is a long tail of outliers in the human population.

    More to the topic, I had an opportunity to discuss accuracy and compliance the other day with a person involved in this type of medical AI. At least according to that person the standard they are held to in the EU is that it mustn’t be worse then state of the art. So for example an AI summing up a patients history mustn’t be worse then a doctor (presumably a statistically average doctor) does on their own. We unfortunately did not have time to go into the details of how state of the art is evaluated. Compliance regulation does not seem to take into account if the doctors get better or worse in the long term.

    Reply
    1. vao

      AI systems are classifying algorithms (as explained in the article), or predictive Markov chains (like the over-hyped LLM tools), or maximum likelihood estimators — basically, fancy statistical functions with the distinctive characteristic of their relying upon billions of parameters.

      Classical statistical techniques deliver answers (whether extrapolations, estimations, classifications) that are accompanied by confidence intervals, levels of significance, goodness of fit values — in other words: quantitative evaluations of the precision, certainty, or quality of the computed result, all based on solid theoretical foundations. Yes, they are difficult to master with all those parametric and non-parametric models and all the underlying mathematical assumptions determining their validity or applicability in various situations — that’s why you need statisticians.

      However, I wonder whether anything like that is in place for AI tools. Those ECG assessments mentioned by Sam below: do they come with confidence intervals? Those tools providing diagnostics, do they come with goodness of fit values? Is any information provided so that the end-user understands whether the AI output is just a wild guess or a near-certainty?

      I fear that since current AI is grounded on big data to build large models with billions / tens of billions / hundreds of billions of parameters, defining the equivalent of confidence levels will remain mathematically intractable.

      Which means that if we do not want to fly blind, trusting AI results without any idea about their accuracy, we need a different kind of AI — which is what Gary Marcus and Ed Zitron have been advocating for a while.

      Reply
  3. Sam

    My company’s software has been using algorithms (AI and otherwise) for decades to produce ECG diagnoses with seemingly no legal issue or patient safety issues. We’re also trying to do it for retinal diagnoses as well.

    What Yves may not know is how different AI’s usefulness is in each specific type of exam and diagnoses vs prescription as an output.

    Reply
    1. Yves Smith Post author

      IM Doc has been reporting AI diagnosis horrors over a wide range of conditions from his hospital and extended circle of former students. He may weigh in to differ. A recent, albeit more general, report:

      About AI. Several of my patients in the tech squillionaire world have access to different AIs than everyone else. Much more “advanced” – whatever that means. They now routinely bring me in 7-10 page AI documents to go over – they have placed all their symptoms and findings and downloaded their Garmin watches etc. Even in these advanced AI systems, there are massive mistakes, there is going down dark tunnels, there are dozens of ridiculous tests demanded, and just at times horrifying suggestions regarding medical therapy.

      This particular report is admittedly broader rather than narrow gauge. But one would think the analysis builds up from the readings of the individual test rests and then relationships among them.

      Reply
      1. Huey

        I’m not sure if this specifically is what Sam’s referring to but ECG machines often give a printout with their own diagnosis based on the result. I’m assuming it must be based on pattern-recognition, similar to current “AI”.

        That said, and again I’m not sure if this is the specific product, but on the machines I’ve seen and used, the diagnoses are wrong often enough that we chastize any doc/medical student who tries to use it instead of interpreting the ECG themself.

        Reply
      2. IM Doc

        I am having a very difficult time anymore knowing exactly what all these new tech innovations actually are. They are being referred to collectively as “AI” however many of these processes do not actually seem to be AI – as in coming up with things on its own.

        There is only one thing that they have come up, and only one, that seems to have been of any benefit at all. And I am not even sure it is “AI”. The modern EMR medical chart is an accumulation of all kinds of labs, notes, consult notes, referrals, insurance denials, etc. In every electronic system I have ever used, it is a complete and total mess, unlike the old paper charts which were the easiest things ever in comparison. Many of these items are data arrays. Many of them are scanned fax documents. Many of them are local documents to the EMR. All have to be handled in different ways usually involving 10-20 clicks per document. Profoundly tedious. There are now new systems that take queries and go mining into all this data in all its forms and is able to produce answers in seconds in what used to take me hours. But in my opinion, that is not really AI as I understand it, although they call it that. It is like a google system designed for medical charts.

        Again, that is the only thing that has been at least remotely helpful and not scary.

        The note taking system is so unstable and unreliable that it seems I spend more time correcting mistakes. These are big and massive mistakes – life-altering if left in the chart. How many doctors are going through this with a fine-tooth comb like I am?

        The ECG example you state above is interesting. For decades, the ECG machines have had a simple pattern-recognition software that lists off likely diagnoses. I would say these older systems are wrong about half the time. I insist that it is turned off on any ECG of mine and certainly my students. Interestingly, we now have an AI assisted ECG reading system, and even more interestingly it is wrong about 70% of the time, often wildly wrong. Essentially, it is worse than the 1990s tech. It is just frightening to me how many times a week I am called by a specialist colleague because some dreadful thing has been put on the chart ECG reading by an AI that is not even close to reality. I shudder to think what happens if these things actually get acted upon.

        The “AI” diagnosis situation is even more scary. It pulls in all kinds of disparate things from the chart – many times, this is completely inaccurate info because the AI note-taking systems are so horrible and the docs just simply do not proofread them. I am just amazed at times what the suggestions are as far as testing and treatment. They are often completely divorced from any kind of reality.

        As an experiment the other day, I had the hospital admin and lots of colleagues in a conference room. We opened up a fake patient – and I read off the symptoms and lab values —- 72 year old with severe acute low back pain, fever, profound acute fatigue, and sudden change in urination frequency. The labs were showing an acute normocytic anemia with a HGB of 11, and acute renal failure with a creat of 2.9. This is a lot of disparate symptoms, but it is also a combination of things that a trained internist would know instantly. I have had board questions repeatedly in my life on certification exams with this exact scenario. You just have to know this stuff. There are literally thousands of these patterns in internal medicine. This is why it is a years long training process. The AI had as its first diagnoses – Acute sepsis syndrome, acute malaria (wow), acute ehrilichiosis ( Huh?), lupus ( again, huh?), dermatomyositis, and acute Glomerulonephritis. All I can say is some of these are entirely head scratchers. Some of them are appropriate. But the list of tests and labs that the patient was asked to get by the AI were legion – and would have cost the Medicare system about 70K. In no way, shape or form, did the system even think of the correct diagnosis – and the pattern diagnosis, which was multiple myeloma ( back pain, fever, anemia and renal failure – it is a known and well-taught pattern) – and the simple 100 dollar test a serum protein electrophoresis – that my internist self would have ordered before anything else. The admin were whomperjawed. This is profoundly scary. This stuff is not ready for prime time in any way shape or form. But it is being used that way – by interns and residents learning – all the way to NPs who do not have a medicine and diagnostic background – to lazy or overworked MDs.

        Even more scary are the all kinds of patients now bringing in documentation from ChatGPT – and despite my begging them to reconsider – they go with the computer often to very detrimental medical or financial results.

        I cannot fix any of this – only attempt to mitigate the damage in my realm.

        Reply
        1. KLG

          Multiple myeloma. Yes. I am just a lowly molecular cell biologist who tutors first- and second-year medical students in hematology, including hematopoietic cancers. A preclinical medical student who did not get the correct answer immediately based on the presentation (age plus symptoms) would be a failing medical student with uncertain prospects and one who is likely to be a menace to the profession and patients.

          Add AI to the list of medical students in dire need of “remediation.”

          Reply
        2. Yves Smith Post author

          Wow. not only to a dull normal are the malaria, acute ehrilichiosis and lupus nuts, so too is dermatomyositis. That is what my father died from. It’s rare (3-10 per million) and none of the symptoms listed match.

          Reply
    2. Skip Intro

      The machine learning algorithms used to match patterns in images are fundamentally different from the generative AI models discussed here. These new models build on the pattern recognition that is so successful, and use it to fabricate new data that statistically matches those patterns. What we currently consider AI, and the topic of this piece is fundamentally limited by the fact that it only fabricates statistically plausible ‘Artificial Information’.
      More frightening than the errors, is the potential for the AI to be gamed by training, so even the ‘right’ answers are biased. If an insurer can filter training data on cancer survival under various treatments, they can influence the model to prefer cheaper treatment, for example.

      Reply
  4. Mikerw0

    Ai in diagnosis will be a thing and the reason is creating yet another systematic method to deny patients insured care that they have been contracted in the policy to receive. And, once the people you would speak to, and the doctors in the ‘health insurers’ who are supposed to review this stuff have been replaced, it will purposefully clog up the system. (Remember the reports from within Aetna, I think, about doctors reviewing thousands of issues in a spreadsheet in like an hour. Sorry I remember the big picture, not all the details.)

    Reply
    1. hickory

      Reminds me of the robosigning scandal, with mortgage specialists supposedly reviewing hundreds or thousands of cases per day and supposedly making informed judgments about them.

      Reply
      1. cfraenkel

        Y, but even more so, reminds me of the shrinkwrap “contracts” we’ve gradually become subject to. At least originally, they started out as physical documents attached to products, and in theory customers could have read them. Now they’re magically invoked in a ‘privacy policy’ or equivalent every time you open a web page. With the pretense that ‘by clicking on accept you have read and agreed to every clause in this 30 page contract no human has ever read.’

        So much for legal theory when corporate profits are at stake.

        Reply
    2. ChrisFromGA

      One problem with that is, regardless of the tool a doctor uses for diagnosis, they must sign off on it, and their career is on the line if AI errors lead to a misdiagnosis. Similar to how a lawyer can use AI to come up with fake case citations, but ultimately it is their professional responsibility to have done their best effort at research, and Rule 11 sanctions can be applied if the judge finds AI slop in their briefs/motions.

      Using the excuse “AI ate my homework” won’t cut it in a courtroom, especially in the worst-case scenario of a malpractice suit. And forget about counter-suing Google or OpenAI, they’ll no doubt put legalese in the ToS for any medical AI tools to absolve themselves of responsibility.

      The point about insurance companies is another matter. I definitely can see them using AI to “robo-deny” claims.

      Reply
    3. samm

      And on the hospital side AI it will also be a thing. It can be used for coding, generating ICD10 codes which translate into billable units. Is it really upcoding if an AI hallucinates it?

      Reply
  5. hickory

    It’s even worse than this article suggests. The author treats the doctor as a drug dispensing mechanism, as if drugs are the only thing to be offered as treatment. And it’s as if the doctor’s only job is to drill through the categories of symptoms and ailments to identify a disease and treatment plan.

    But great doctors do so much more – they listen to the patient, ask probing questions, get suspicious when things don’t add up (either in the patient’s story or the outcome of tests), and educate the patient. They make suggestions about exercise, diet, sleep, and other foundational quality-of-life elements that can dramatically improve or worsen the course of a disease. They notice if patients actually take the medicines/follow suggestions. They go to bat against insurance agencies that get stingy to ensure the patient gets the needed tests and care.

    There’s so much more that caring humans do than just parse a set of symptoms and identify drugs, and I don’t see AI doing these things, especially after it’s been widely adopted and there’s less risk of people switching over to human doctors. We’re not even in the enshittification phase of AI ‘doctors’ yet – just think how great that’ll be!

    Reply
    1. 123abceng

      If a doctor talks to you, it doesn’t mean he is doing something specific about it. “Great doctors” do not exist anymore. A patient might have an illusion, that he was educated, heard etc. In reality: biology is too complex, and medicine is not biology-aware. Some doctors, trained in 1920-50s… maybe. But we should stop fooling ourself, thinking that MDs of 2020s, some of them “are better”. Even biology itself, since 1986, when David Baltimore and his affiliates were caught, and surprisingly won the case, even since then we know, that biology lies on a massive scale.

      Reply
  6. stefan

    My son was treated for cancer as a teenager. Afterwards, while finishing college, he was a lab intern at NHGRI (National Human Genome Research Institute) where he learned how to use the microarray. (He met his future wife, who was visiting from Sweden for her PhD, while showing her the ropes on the microarray.) Eventually he got an MD/PhD in cancer genetics and moved to Sweden where he invented and perfected a highly precise blood test for monitoring ctDNA (circulating tumor DNA, which certain tumors like breast cancer or lung cancer produce). Now he is working with some Norwegians to train AI to evaluate standard blue-stain histology slides for digital biomarkers in colorectal cancer. He is also co-director of a new national Swedish initiative to improve cancer treatment in women’s health.

    The future issue that I see is that increasing dependence on AI will dull the acuity of health practitioners. Medical caregiving is as much an art as it is a science.

    Reply
  7. flora

    One wonders what the malpractice insurance companies will do?

    Follow AI results or else? Do not blindly follow AI or else? And what if the hospital (now possibly PE owned) or your group practice insists on following AI results in treatment?

    Older doctors with many years of practice might spot where AI has gone wrong. Younger, newly minted docs who were trained in teach-to-the-test from grade school on might lack enough experience and/or a willingness to go against the AI generated “correct answer.”

    Or, speculating freely, what if AI code includes biases about who should be treated for what? Imagine AI in Germany being coded to check the age of the ill person and saying for people over 65 “tell them to take 2 aspirin and send them home.”

    / my 2 cents.

    Reply
  8. flora

    Anecdote: I remember 15 years ago or so some CS/IT profs from another uni began working with my uni’s teaching hospital, specifically with the teaching hospital’s nursing department Phd educators. The idea was to find a way for nurses seeing patients to capture what patients told the nurses about symptoms, life style changes, etc and record those things electronically by a computer program. The nurses notes would then be accessible to doctors via computer progam.

    The idea behind this project was that patients often would tell nurses significant things that patients might not tell doctors. Nurses were seen as sympathetic individuals, not as intimidating “authority figures.”

    Well, that sounded like a very reasonable idea. A good idea.

    To think that such a reasonable sounding idea has developed or morphed in to this….

    Reply
  9. Mildred Montana

    I do believe that AI will turn out to have some useful (maybe limited) medical applications. For instance, radiology. I’ve read that reading CT scans is part art, part science. That no radiologist is right 100% of the time and the error rate is about 25%. Again, only from my reading. No practical experience here.

    So, wouldn’t it be feasible to feed AI billions of, say, mammograms, have it analyze them, and if necessary compare its results with actual clinical diagnoses. Of course N would be huge, but isn’t that one of the supposed strengths of AI? Something it was designed for? Power in numbers? Something beyond human capabilities?

    Anyway, just a thought and keeping in mind that the coin of technology always has two sides.

    Reply
  10. flora

    When AI comes up with totally wrong answers how to doctors provide feedback to AI telling it its answer is wrong. Does medical AI have a VAERS-like feedback mechanism?

    Reply
  11. Tom Stone

    I have both idiosyncratic drug reactions an multiple health issues at age 72, using AI to prescribe drugs for me would likely either kill me or reintroduce me to the lovely nurses in the ICU.

    Reply
  12. WillD

    Everyone expects, and wants, AI to be totally logical, rational and unbiased in all things. The reality, however, is that it is not because of a) the vast amounts of ‘bad’ data it is fed along with the good stuff, and b) the programming that allows for illogical and contradictory ‘reasoning’.

    This makes it just as bad, and sometimes worse, than human reasoning, which in turn means the only advantage it has over humans is that it can trawl massive amounts of data it has ‘learned’ in order to provide its questioner with answers. And it can do this very quickly.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *