Is AI a Reliable Source of Medical and Health Advice?

Answer: No, although it will get better

Two years ago I wrote a series of articles on digital transformation in healthcare. This included AI, wearable technology and digital health systems. The intervening two years have seen an extraordinary acceleration in both the capacity and reach of AI. Large language models are changing all areas of human activity, including medicine. They are bringing both incredible opportunities but also challenges and threats.

In order to appreciate the current challenges of AI in healthcare it is important to address the misconceptions around how these systems work. They are designed to construct plausible language rather than filter truth or fact. Current AI engines have no innate intelligence, no internal mechanism for causal reasoning. They make algorithmic predictions of the next word based on their training data set. The focus on language fluency can cause them to hallucinate (a technical term for giving the wrong answer). They like to please so they would rather make stuff up and be confidently wrong than admit ignorance. Surprisingly current models are not that great at numbers and frequently mix up scientific references. So what does this mean as increasing numbers of people turn online for health and wellness advice?

The Current Landscape

The recent study from 2026 BMJ Open examined the accuracy of large language models in responding to medical questions. The review evaluated the performance of five AI platforms across various medical domains. They considered standard questions relating to cancer, vaccines, stem cells, nutrition and human performance and found significant variability in accuracy rates. Overall, almost 50% (49.6%) of search term results were described as problematic, 30% somewhat problematic and 19.6% highly problematic. That is to say almost one in five responses were determined to be potentially harmful.

The study highlighted that AI chatbots performed better on general medical knowledge questions but struggled with complex clinical reasoning, nuanced patient scenarios, and situations requiring contextual understanding of individual patient circumstances. It is somewhat reassuring that questions around cancer and (possibly surprising) vaccines provided more accurate answers. Presumably related to the greater established evidence base. Questions around nutrition and athletic performance performed worst.

Diagnostic Accuracy and Clinical Reasoning

One of the most significant findings from recent research is that while AI systems can demonstrate impressive breadth of medical knowledge, their diagnostic accuracy remains inconsistent. A 2024 study published in Learning Health System found that ChatGPT-4's diagnostic accuracy varied significantly by medical specialty, with particular weaknesses in dermatology, psychiatry, and conditions requiring visual assessment or subjective interpretation of symptoms.

The fundamental limitation returns to how these systems were built and designed to function. Large language models identify patterns in text data. Current AI engines do not yet understand disease mechanisms, physiological processes, or the complex interplay of symptoms that experienced clinicians recognize. They cannot yet perform physical examinations, order appropriate investigations based on clinical judgment, or integrate the subtle non-verbal cues that often guide diagnostic thinking.

Moreover, the BMJ Open study noted concerning patterns of overconfidence in AI responses. The systems would provide definitive-sounding answers even when the evidence base was uncertain or when multiple diagnostic possibilities should have been considered. This mirrors a broader problem in AI development, these systems are optimized for generating confident, fluent text rather than expressing appropriate clinical uncertainty.

The Question of Training Data

The reliability of AI medical advice is fundamentally constrained by its training data. Large language models are trained on vast amounts of internet text, academic papers, and other sources. However, this creates several problems specific to medical information.

First, not all medical information online is accurate or current. AI systems may incorporate outdated treatment protocols, debunked theories, or information from unreliable sources into their knowledge base. Second, the systems cannot distinguish between high-quality evidence from randomized controlled trials and anecdotal reports or opinion pieces. Third, medical knowledge evolves rapidly, but AI models have knowledge cutoff dates and may not incorporate recent clinical guidelines or emerging research.

The BMJ Open study specifically identified instances where AI chatbots provided treatment recommendations that contradicted current clinical guidelines or cited non-existent studies, a phenomenon researchers termed "reference hallucination." This is particularly dangerous because it creates a veneer of scientific legitimacy while potentially misleading users.

Individual Variation and Context

Medicine is inherently personal. Treatment decisions depend on individual patient factors including comorbidities, medication interactions, allergies, previous treatment responses, patient preferences, and social circumstances. AI systems struggle profoundly with this individualization.

The 2026 BMJ Open study review emphasized that AI chatbots frequently provided generic advice that failed to account for patient-specific factors. A medication recommendation that might be appropriate for a young, healthy adult could be dangerous for an elderly patient with kidney disease. A dietary intervention suitable for one person might be contraindicated for another with metabolic conditions.

Furthermore, AI cannot assess the severity or urgency of symptoms. What a patient describes as "chest pain" could range from muscular strain to life-threatening cardiac emergency. Human clinical judgment, informed by systematic training and pattern recognition from seeing thousands of patients, remains essential for these critical assessments.

Regulatory and Liability Concerns

The current regulatory landscape has not kept pace with AI deployment in healthcare. Most AI chatbots used by the public for health advice operate without regulatory oversight as medical devices. They carry disclaimers stating they do not provide medical advice, yet are explicitly used for exactly that purpose. This creates an obvious accountability and potential liability gap.

Conclusion

The evidence from the 2026 BMJ Open study and related research points to clear conclusions about AI as a source of general medical advice: current large language models are not yet reliable enough. Their accuracy is too variable, their tendency toward hallucination too high, and their inability to individualize care too fundamental.

Is this just the fear of a professional dinosaur hoping to stop the tide coming in? Absolutely not. AI has already revolutionised medicine it has demonstrated superiority to humans in a number of repetitive tasks including radiology and histopathology analysis. I use AI systems every day including transcription for my consultation notes. This has hugely improved my practice by eliminating bandwidth previously wasted in recording a consultation in which I can now place more energy into active listening. I also use AI engines for diagnostic support. Critically I have access to engines trained specifically on medical data sets in addition to a fundamental understanding of effective prompts to put specific questions in context. This is consistent with multiple domains of expertise. Current AI systems seem to be most effective when searching within a domain in which an individual already holds some degree of expertise.

In my opinion, next generation AI will likely be trained using experiential learning models rather than language models. For certain AI is not going away and it will evolve and get better. At the moment the available engines are problematic especially in areas of nuance, complexity and uncertainty. The best we can do as individuals is to remain informed and understand the potential pitfalls in searching for health information online.

Dr David Owens

Family Medicine, General Practice, Sports Medicine

MB ChB (Leeds)
PGDipSEM (Bath)
MRCGP (UK)
FHKAM (Family Medicine)
Honorary Clinical Assistant Professor in Family Medicine (HKU)

View Profile

Health Articles by Dr David Owens

Dr David Owens

Family Medicine, General Practice, Sports Medicine

MB ChB (Leeds)
PGDipSEM (Bath)
MRCGP (UK)
FHKAM (Family Medicine)
Honorary Clinical Assistant Professor in Family Medicine (HKU)

View Profile

Health Articles by Dr David Owens

References

Tiller, N.B., Marcon, A.R., Zenone, M. et al. (2026). 'Generative artificial intelligence-driven chatbots and medical misinformation: an accuracy, referencing and readability audit.' BMJ Open, 16, p.e112695. Available at: https://bmjopen.bmj.com/content/16/4/e112695
Thirunavukarasu, A.J., Ting, D.S.J., Elangovan, K. et al. (2023). 'Large language models in medicine.' Nature Medicine, 29(8), pp.1930–1940. Available at: https://doi.org/10.1038/s41591-023-02448-8
Lee, P., Bubeck, S. and Petro, J. (2023). 'Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine.' New England Journal of Medicine, 388(13), pp.1233–1239. Available at: https://doi.org/10.1056/NEJMsr2214184
Levkovich, I. (2025). 'Evaluating diagnostic accuracy and treatment efficacy in mental health: a comparative analysis of large language model tools and mental health professionals.' European Journal of Investigation in Health, Psychology and Education, 15(1), p.9. Available at: https://doi.org/10.3390/ejihpe15010009
Mackenzie, E.M., Sanabria, B., Tchack, M., Khan, S. and Rao, B. (2024). 'Investigating the diagnostic accuracy of GPT-4's novel image analytics feature in dermatology.' Journal of the European Academy of Dermatology and Venereology, 38, pp.e954–e956. Available at: https://doi.org/10.1111/jdv.20006