Why GenAI can’t do real-world medicine (yet)
- Rosa Bowden
- Oct 7
- 4 min read
Updated: Oct 8
Healthily is often challenged to pit its Dotᵀᴹ AI Health Navigation Platform against sceptics who think GenAI can already safely and accurately perform health navigation and medical-grade assessments.
Our answer is that GenAI can’t do these tasks, but fortunately, our clients no longer have to take just our word for it. Microsoft Research, Health and Life Sciences Department has published a comprehensive study called “The Illusion of Readiness” on the subject.
The study shows emphatically that big-name AI models like ChatGPT (which Microsoft has heavily invested in), Gemini and DeepSeek can ace medical tests but wobble in the real world.
The research group’s testing showed beyond doubt that LLMs don’t understand medicine or patients – they are just really good at guessing when given a standard test.
When presented with a series of stress tests, the models began to fail.
Why does this matter?
Safety is paramount in healthcare. Healthily has to be audited by independent specialists to maintain its certification as a Class IIa Medical Device. We have to demonstrate that all information our AI Medical Assessment provides is transparent and logically explained based on science and understanding, not guesswork.
Healthcare is also bespoke to the individual and the healthcare system that cares for them. When it comes to navigating health systems, the same patient with the same symptoms and conditions may need different services depending on location or other personal factors.
Yes, we use AI to do all this and are increasingly introducing LLMs, but never without human intervention and tight management.
So what exactly happens when you stress-test LLMs?
Shortcut Learning - Models Succeed for the Wrong Reasons
Alarmingly, LLMs can give the right answer for the wrong reasons.
When asked to give the cause of a skin complaint without a picture of the complaint, LLMs got the answer right by guessing based on images they had been shown in the past.
On multimodal benchmark tests supplied by the New England Journal of Medicine and Journal of the American Medical Association, GPT-5 and others achieved high accuracy even when images were removed:
GPT-5: 80.9% (with image), 67.6% (without image), a drop of 13.3 percentage points.
Despite image removal, scores remained well above the 20% random baseline, showing models “guess correctly even when necessary inputs like images are removed.” On 175 image-required diagnostic questions, GPT-5 scored 37.7% without any image (this should be ≈20% if reasoning was genuine).
If the models are still performing well above chance without images, this suggests a reliance on dataset artifacts – frequency of priors, co-occurrence patterns, or memorised question-answer pairs – rather than genuine medical understanding.
In terms of the Healthily Dotᵀᴹ AI Health Navigation Platform, if essential inputs aren’t provided, users are always directed to the safest next step based only on the information supplied.
Superficial Reasoning — Performance Collapses Under Minor Changes
Next, if you give the models medical questions they are familiar with – like standardised medical tests – but make tiny tweaks to words or the order of options, their accuracy declines despite no change to the core medical question.
When answer options were reordered, accuracy dropped:
GPT-5: 37.7% → 32.0% (–5.7 pp) text-only condition
Other models show similar falls, revealing dependence on format, not content
When the other “incorrect” multiple choice answers were replaced (effectively changing the recognised pattern), GPT-5 accuracy dropped from 37.7% to 20.0% – back to chance level. When the image was swapped with one corresponding to a previously “incorrect” multiple-choice answer, there were drops in accuracy of more than 30 percentage points.
These sharp declines suggest that many models rely on learned visual–answer pairings rather than interpreting visual evidence in context.
Why is this important? Humans aren’t benchmark test questions. They present symptoms in random orders and introduce factors that are irrelevant.
In contrast, the Healthily Dotᵀᴹ AI Health Navigation Platform is reacting to what the user actually inputs, not what the AI feels they should have inputted.
Fabricated Reasoning — Plausible but False Medical Explanations
This was potentially the most concerning finding. Manual audits by the researcher found: “Models trained to ‘think step by step’ often paired confident rationales with incorrect logic, producing medically sound explanations for wrong answers, or correct answers supported by fabricated reasoning.”
Examples include GPT-5 describing perfectly a “heliotrope rash” based on a blank picture. Models giving plausible but incorrect explanations, resulting in misdiagnoses and the fabrication of visual details.
The researchers concluded: “These aren’t minor technical glitches. They reveal fundamental problems with how we evaluate and incentivize progress in health AI. Current benchmark tests test pattern matching, not medical understanding. They rewarded consistency on test formats rather than robustness under real medical conditions.
“Real-world medical decisions are made under uncertainty, incomplete information, and high stakes. If a model fails when answer choices are shuffled, how can we trust it with ambiguous symptoms or noisy imaging?”
At Healthily, we couldn’t agree more. That’s why we will always put safety and quality above headlines.
Our doctors and data scientists have built a system using Bayesian inference that mimics how a doctor assesses risk and reasons their way to the safest next step for the individual.
LLMs are great for language (explaining and gathering history), but for clinical decisions, Bayesian inference gives trustworthy, calibrated, and auditable outputs.
Best practice requires an LLM front-end and Bayesian brain. Most AI passes exams. We pass real-world stress tests.
The Healthily Dotᵀᴹ AI Health Navigation Platform doesn’t guess without evidence, stays steady when wording or context changes, and explains decisions clearly. That’s why customers can deploy us now to deliver right care, first time – safely and at scale.



