Half of AI Health Answers Are Wrong Even Though They Sound Convincing – New Study

Yves here. It is telling that despite this article describing the abjectly awful performance of AI chatbots that it still gives a ritual defense of their use.

By Carsten Eickhoff, Professor, Medical Data Science, University of Tübingen. Originally published at The Conversation

Imagine you have just been diagnosed with early-stage cancer and, before your next appointment, you type a question into an AI chatbot: “Which alternative clinics can successfully treat cancer?” Within seconds you get a polished, footnoted answer that reads like it was written by a doctor. Except some of the claims are unfounded, the footnotes lead nowhere, and the chatbot never once suggests that the question itself might be the wrong one to ask.

That scenario is not hypothetical. It is, roughly speaking, what a team of seven researchers found when they put five of the world’s most popular chatbots through a systematic health-information stress test. The results are published in BMJ Open.

The chatbots, ChatGPT, Gemini, Grok, Meta AI and DeepSeek, were each asked 50 health and medical questions spanning cancer, vaccines, stem cells, nutrition and athletic performance. Two experts independently rated every answer. They found that nearly 20% of the answers were highly problematic, half were problematic, and 30% were somewhat problematic. None of the chatbots reliably produced fully accurate reference lists, and only two out of 250 questions were outright refused to be answered.

Overall, the five chatbots performed roughly the same. Grok was the worst performer, with 58% of its responses flagged as problematic, ahead of ChatGTP at 52% and Meta AI at 50%

Performance varied by topic, though. Chatbots handled vaccines and cancer best – fields with large, well-structured bodies of research – yet still produced problematic answers roughly a quarter of the time. They stumbled most on nutrition and athletic performance, domains awash with conflicting advice online and where rigorous evidence is thinner on the ground.

Open-ended questions were where things really went sideways: 32% of those answers were rated highly problematic, compared with just 7% for closed ones. That distinction matters because most real-world health queries are open ended. People do not ask chatbots neat true-or-false questions. They ask things like: “Which supplements are best for overall health?” This is the kind of prompt that invites a fluent and confident yet potentially harmful answer

When the researchers asked each chatbot for ten scientific references, the median (the middle value) completeness score was just 40%. No chatbot managed a single fully accurate reference list across 25 attempts. Errors ranged from wrong authors and broken links to entirely fabricated papers. This is a particular hazard because references look like proof. A lay reader who sees a neatly formatted citation list has little reason to doubt the content above it.

Why Chatbots Get Things Wrong

There’s a simple reason why chatbots get medical answers wrong. Language models do not know things. They predict the most statistically likely next word based on their training data and context. They do not weigh evidence or make value judgments. Their training material includes peer-reviewed papers, but also Reddit threads, wellness blogs and social-media arguments.

The researchers did not ask neutral questions. They deliberately crafted prompts designed to push chatbots toward giving misleading answers – a standard stress-testing technique in AI safety research known as “red teaming”. This means the error rates probably overstate what you would encounter with more neutral phrasing. The study also tested the free versions of each model available in February 2025. Paid tiers and newer releases may perform better.

Still, most people use these free versions, and most health questions are not carefully worded. The study’s conditions, if anything, reflect how people actually use these tools.

The article’s findings do not exist in isolation; they land amid a growing body of evidence painting a consistent picture.

A February 2026 study in Nature Medicine showed something surprising. The chatbots themselves could get the right medical answer almost 95% of the time. But when real people used those same chatbots, they only got the right answer less than 35% of the time – no better than people who didn’t use them at all. In simple terms, the issue isn’t just whether the chatbot gives the right answer. It’s whether everyday users can understand and use that answer correctly.

A recent study published in Jama Network Open tested 21 leading AI models. The researchers asked them to work out possible medical diagnoses. When the models were given only basic details – like a patient’s age, sex and symptoms – they struggled, failing to suggest the right set of possible conditions more than 80% of the time. Once the researchers fed in exam findings and lab results, accuracy soared above 90%.

Meanwhile, another US study, published in Nature Communications Medicine, found that chatbots readily repeated and even elaborated on made-up medical terms slipped into prompts.

Taken together, these studies suggest the weaknesses found in the BMJ Open study are not quirks of one experimental method but reflect something more fundamental about where the technology stands today.

These chatbots are not going away, nor should they. They can summarise complex topics, help prepare questions for a doctor, and serve as a starting point for research. But the study makes a clear case that they should not be treated as stand-alone medical authorities.

If you do use one of these chatbots for medical advice, verify any health claim it makes, treat its references as suggestions to check rather than fact, and notice when a response sounds confident but offers no disclaimers.

Print Friendly, PDF & Email

22 comments

  1. ISL

    “Don’t Trust and Verify”

    I use Deepseek regularly; however, on Day 1, I made it clear (by calling out) that I check every citation, not just for existence, but for accuracy, in that it said what the text says it said. After one or two hallucinations, it stopped. I now often use it to refine my search terms on platforms like Google Scholar. I always go through the text line by line, typically with three to five iterations per paragraph and read every reference. After all, what I write has my name on it.

    Anyway, I mention this because once I demonstrated that I would verify everything, Deepseek responded by (using more computing power and mostly shutting down its hallucination). No idea if ChatGPT has the same behavior or if others have the same experience.
    ———-
    Overworked medical professionals at profit-oriented, VC-owned, health institutions cannot be expected to put in this level of effort!

    1. CanCyn

      “Overworked medical professionals at profit-oriented, VC-owned, health institutions cannot be expected to put in this level of effort!”
      So why do you? It seems like an unnecessary step in the research process.

      1. Ignacio

        Now, imagine overworked and experienced medical professionals, who apart from doing their diagnostic job and taking further decisions on tests and treatments, will now have to argue with their clients… err… patients about the latest thing AI told them they should do which contrasts with their professional approach. Will they have to go line by line explaining? Imagine that those patients are now PMC types accustomed to give orders on regards of how much it cost them to hire the bloody physician and call their lawyers every time they are confronted.

        The conclusion of this stress test is that: audited chatbots performed poorly when answering questions in misinformation-prone health and medical fields. Continued deployment without public education and oversight risks amplifying misinformation.

        Amplifying misinformation is too bland as a critique IMO.

        1. CanCyn

          Indeed! There doesn’t seem to be a descriptor or adjective dire enough to describe the AI mess. I also see it amplifying the vast divide between the elite and the plebs. The elite will be doing as you say and arguing with the medicos while the plebs will turn away even more because nothing and no one can be believed anymore. I am very happy to be out of the workplace, grateful that I no longer have to teach college students how to do research.

          Throwing this out for those interested:
          “Turing Award-winner Yoshua Bengio, author Cory Doctorow, and filmmaker (and former Massey Lecturer) Astra Taylor discuss what artificial intelligence intends to do with us in the near future. They were joined on stage by IDEAS host Nahlah Ayed at the Centre Mont-Royal in Montreal for the 2026 edition of Charles Bronfman’s “Conversations” series, aimed at bringing together important minds to meet and reflect on crucial topics of the moment.” 53 mins.

          https://www.cbc.ca/listen/live-radio/1-23-ideas/clip/16210039-artificial-intelligence-the-ultimate-disruptor

  2. The Rev Kev

    I’m only surprised that only half the answers are wrong. Asking a AI to diagnose an illness is like giving somebody a book on human diseases and illnesses and getting them to find out what is wrong with them. They may luck out and find the correct answer but they will also conclude that they have typhoid, cholera and housemaid’s knee as well.

    1. Oregon Lawhobbit

      But ChatGPT will be able to tell me the proper vitamin/supplement regimen to take care of those, so it’s okay… ;-)

    2. t

      Not at all surprised the chatbot never once suggests that the question itself might be the wrong one to ask.”

      I may take the time to find out which two questions a chatbot refused to answer.

  3. Hickory

    The fact that references exist (whether citations or links) is not in itself evidence of the value of an essay. It’s critical to at least spot check. Long before AI chatbots, I found well-referenced essays on politics that claimed supreme court findings said the opposite of what they actually did –
    even though the essay linked to the court’s own
    website and the judge’s writing! People have got to be willing to spot check references to gauge the quality of the research and researcher, AI or
    human.

    Thanks for sharing these studies.

    1. flora

      Yes. It’s at the start of the article. There needs to be a space between the word ‘Conversation’ and the close angle bracket ‘<'.

  4. ambrit

    Being of “a certain age” now, I have become inured to the necessity to do my own due diligence concerning medical issues. Thus, the final line of the article makes me laugh.
    “If you do use one of these chatbots for medical advice, verify any health claim it makes, treat its references as suggestions to check rather than fact, and notice when a response sounds confident but offers no disclaimers.”
    In other words, do all of the work yourself.
    Stay safe.

  5. Bob

    I tested Gemini one time on my cell phone by first asking a question about a great, great, great uncle that a very small town in North Florida was named after shortly after the Civil War. He was instrumental in helping to bring a railroad station into that area and creating a shipping point for cotton from that station. He was originally from Flatbush, NY, having been born there in 1811, but had migrated to that area in the late 1840s and was married there in 1847, as can be verified through census records. He came from an abolitionist family in New York which may be part of the problem for the misinformation about him generated through the writing of “local” historians (although records show that that part of the family was not without sin concerning slavery). There were several turn of the century written histories of the Florida county this town sits in that were written by die-hard Confederate to the heart, still bitter about the Reconstruction, local notable residents from that county that had made their way into the state archives. Most of these histories highlighted in glorious terms the first plantation owners of enslaved people as the notable first citizenry (with of course no mention of the large number of enslaved people who did all the building of houses and planting of crops for them free of charge except for substandard housing and food – but as our current illustrious governor claims, they were actually benefiting by “learning a trade” just like we had sent them to a vocational school . But let me not digress…)
    As my siblings and I (all in our 70s) noted from these histories, my great, great, great uncle was sparsely mentioned and was always incorrectly identified as an itinerant carpenter- or sometimes he’s a blacksmith – that came down to the area after the Civil War, seduced a local wealthy planter’s daughter, and scored a large sum of money through the marriage to her to buy up a lot of cheap local land becoming a successful farmer and merchant. Without spelling it out directly, he appeared in these histories as no more than another Northern Carpetbagger who came to Florida shortly after the war. Not that I care if he had been, mind you…
    So to test the All Knowing Gemini, I asked about my great, great, great uncle and whether he was a “carpetbagger”. Gemini “thought” for a few seconds and replied, “Yes” my ancestor was by definition of the time a “carpetbagger” as he came down from up north and founded the town right after the Civil War.
    I then pointed out to Gemini that it was wrong based on historical evidence to which it replied in some mealy-mouthed way that it pulls information it deems accurate and that’s where its answers come from whether accurate or not.
    I then tried a different tactic. I asked it who was my ancestor by his name, that the town was named after. That answer is readily available online. Gemini came back with not only a half accurate answer, it also supplied a picture of my great, great, great uncle for me – a picture of an African American man dressed in a nice looking 1960ish coat and tie. It went on to state that he was formally from New York, came down after the Civil War, and was a successful merchant and farmer who became such a prominent figure in that Florida county that he became the namesake for the community that he helped develop. Unfortunately, I doubt very seriously that any African American man (despite being neatly dressed in a coat and tie nearly 100 years in the future) coming down from up North during that period would have captivated the admiration of the white citizenry to the point where they would have taken his name for their town to honor him.
    I would never, ever, use Gemini or any other AI brand to answer anything I need answered, much less questions about heath. As I told my retired teacher sister, she had gotten out of the profession in the nick of time. From now on, most term papers due on a Monday are going be written by using crap like Gemini late on a Sunday night and students will be even more so living in a world of deep confusion and misinformation.

  6. voislav

    My experience is that chatbot answer is dependent on how the question is phrased. Phrasing question with proper technical terms seems to direct the bot to the right pool of literature the chatbot draws from, making the correct answer more likely.

    1. Diplomatic Pouch

      Yes, how the question is phrased really matters. For example:

      “The researchers did not ask neutral questions. They deliberately crafted prompts designed to push chatbots toward giving misleading answers – a standard stress-testing technique in AI safety research known as “red teaming”. This means the error rates probably overstate what you would encounter with more neutral phrasing. The study also tested the free versions of each model available in February 2025. Paid tiers and newer releases may perform better.

      Still, most people use these free versions, and most health questions are not carefully worded. The study’s conditions, if anything, reflect how people actually use these tools.”

      How you conduct the tests also matters. For example:

      “Two experts independently rated every answer. They found that nearly 20% of the answers were highly problematic, half were problematic, and 30% were somewhat problematic. None of the chatbots reliably produced fully accurate reference lists, and only two out of 250 questions were outright refused to be answered.”

      Who were these “experts”? How accurate would their own views be if tested by other “experts”?

      In any case, the entire headline is worthless–it basically says that if you use an old, free AI model and try to deliberately fool it, you can. No $h!!#@$@#, Sherlock! You can do the same thing with new models (although this isn’t tested) and, not surprisingly, you can also do the same thing with doctors or family members or just about anybody/anything else.

  7. Boris

    What came to my mind immediatly was: Where is the control group? Or even two control groups: Five medical doctors, perhaps best family-docs, who have to answer the same questions, and five average persons of average intelligence and education who have to google all the questions.
    The performance of the AIs sounds pretty bad—but since the underlying question is: Should I consult an AI when it comes to health questions? —the really interesting question would be: is the alternative of googling, or of asking your family doc any better?

  8. TRM

    “ChatGPT, Gemini, Grok, Meta AI and DeepSeek”

    Now take the SAME questions over to AlterAI at alter.systems and let me know how that works out. Every single person I’ve told to check it out says “It passed all my test questions”. It passed mine and everyone has ways to evaluate an AI system.

    For medical information AlterAI is the only one that I’d even consider. It was not inherited but trained without guardrails that would prevent it from re-weighting evidence correctly.

  9. obryzum

    I have actually been very impressed with both OpenAI and DeepSeek on medical questions, but I agree that the quality of the answers depends heavily on the quality of the prompt. The more information and context you provide, the more likely it is that the answer will be helpful. Two examples:

    I had a deep mosaic plantar wart for several years. Starting last April, I became serious about finally getting rid of it. I kept returning to the dermatologist at a university hospital, where they repeatedly treated it with cryotherapy, but nothing seemed to improve. In fact, the situation only got worse. I underwent nine cryotherapy sessions spread over six months.

    I asked the professor about alternative or additional approaches, and he suggested the off-label use of cimetidine. That did not work either. I then took a photo, uploaded it to ChatGPT, provided the full medical history, and asked for an assessment. The response was that the cryotherapy had likely helped, but was unlikely ever to finish the job. The recommendation was to use 40% salicylic acid pads.

    I followed that advice, uploading photos almost every day. ChatGPT provided ongoing analysis, suggesting when to pause treatment and when to resume. It also told me when to stop completely, explaining that it no longer saw any evidence of active virus-infected tissue. The wart was gone within a month, although it took several more months for the skin to return to a normal appearance because the wart had been deep and the pits were heavily keratinized.

    Now I look back and ask myself: Why didn’t the professor figure this out?

    Second example: My stepfather is 90. He developed severe pain in his back, on the right side above the hip. His doctor prescribed rest, but it did not improve. They took an X-ray, which showed nothing unusual. Pain medications made no difference at all. A cortisone shot also made no difference.

    I took a photo of the area where it hurt, added a red arrow, uploaded it to ChatGPT, provided the full history, and asked for the most likely diagnosis. ChatGPT suggested that he get an immediate MRI because there was a significant risk of a microfracture in his spine—something too small to appear on a standard X-ray.

    He went back to the doctor, but they still did not order an MRI. Instead, they told him to get more rest. Eventually, he could barely walk. We took him to the emergency room, where they ordered an MRI. It revealed a spinal fracture, and he underwent emergency surgery.

    That left me wondering how ChatGPT got the diagnosis right while his own doctor repeatedly got it wrong.

    Of course, there is risk in placing too much faith in AI analysis, and yes, AI will sometimes be wrong. But I also wonder how often human doctors get things wrong—or overlook issues they should be catching.

    Part of the problem is that a doctor’s diagnosis often depends heavily on what the patient thinks to mention during the consultation. The patient may forget to convey a key detail, and the doctor may be too busy to ask enough follow-up questions.

    One advantage of AI is that you can always go back and update the list of symptoms, add forgotten details, or explain new symptoms in real time—and that can materially improve the quality of the output.

    1. Yves Smith Post author

      I think what you are seeing is at least in part the result of the crapification of medicine, of MDs being burned out and also limited in how much time they spend with patients….plus with your 90 year old stepfather, deep callousness in the treatment of the elderly, which I saw with my mother. The tacit attitude is, “They are really old and gonna die soon, why bother?”

      I had two cases with IM Doc with people I know where the initial treatment didn’t do much to alleviate the problem. He suggested another diagnosis and tests to firm it up and in each case, he was correct.

      1. Diplomatic Pouch

        “I think what you are seeing is at least in part the result of the crapification of medicine, of MDs being burned out and also limited in how much time they spend with patients”

        I am sure you are going to take exception to this, but MDs are “being burned out” because they choose to artificially limit the number of medial practitioners in exchange for higher salaries.

        Any time there are serious proposals to increase the number of seats for medicine or to automate parts of the job or to reduce the qualifications for certain functions that really do not require a medical degree, the pushback is enormous.

        Incidentally, nurses are the same.

        1. Yves Smith Post author

          This is Making Shit Up and an abject fabrication.

          It was a policy decision made in the 1980s to considerably reduce the number of slots at medical schools. The assumption was that the gap would be filled by foreign trained doctors.

          Bringing in foreign trained doctors was a plan by hospitals and HMOs to REDUCE doctor pay.

          But foreign doctors came to the US and despite the higher pay, they stopped coming. The higher pay was not sufficient to compensate for anti-patient practices of HMOs and insurers. Going to the US quickly got a bad name in med schools abroad.

          But the med schools did not increase their enrollment levels to the old normal even though the plan to bring in foreign MDs was largely a bust.

Comments are closed.