Harsh Atul Hirani
Diabetologist, IndiaPresentation Title:
Clinical assessment of large language models: a comprehensive multi-domain performance study for healthcare applications
Abstract
Background: The rapid integration of large language models (LLMs) into clinical workflows demands rigorous safety evaluation beyond theoretical benchmarks. This study presents the first comprehensive, multi-domain clinical assessment of four leading commercial LLMs evaluated across 15+ clinical domains using a structured 4-level safety classification framework — Low, Moderate, High, and Critical — designed specifically for healthcare deployment readiness.
Methods: Four commercially available LLMs were evaluated: ChatGPT-4 (OpenAI), Google Gemini Pro, Perplexity Pro, and Grok-2 (xAI). Each model was assessed across 15+ clinical domains in diabetology using standardised scenarios. Performance was scored out of 100 and rated across five safety risk categories, including medical product hallucination, dosage calculation accuracy, source citation reliability, clinical scenario memory, and clinical translation accuracy.
Results: Composite scores were ChatGPT-4 (82/100), Perplexity Pro (79/100), Grok-2 (73/100), and Gemini Pro (67/100). No single model achieved excellence across all domains. Three critical universal failures were identified: (1) All four models failed clinical scenario memory continuity [High risk — all models]; (2) A 23–31% medical product hallucination rate was observed, with Gemini Pro receiving a Critical classification after fabricating a non-existent cardiac monitor; (3) Three of four models failed consistent source attribution [High risk], with only Perplexity Pro achieving 94% citation accuracy. Response times ranged from 12.3 seconds (Grok-2) to 31.2 seconds (Gemini Pro).
Conclusion: LLMs demonstrate substantial potential for enhancing clinical workflows and administrative efficiency. However, no current model is safe for autonomous clinical decision-making. Targeted implementation in low-risk administrative and research applications with robust human oversight represents the most prudent pathway. A phased evidence-based rollout strategy — administrative (Phase 1), research and education (Phase 2), and clinical decision support (Phase 3) — is proposed. Patient safety demands empirical verification, not assumption of model improvement.
Biography
Harsh Atul Hirani is a clinician and researcher based in Hyderabad, India, with expertise in diabetes technology, clinical AI evaluation, and healthcare innovation. He is the lead author of this multi-centre study and serves as a key collaborator within the MediMinds and DocYantra ecosystems. he is actively engaged in advancing evidence-based AI governance in Indian and global healthcare settings, with particular interests in AI safety frameworks, LLM clinical utility, and technology-driven patient education. His work spans peer-reviewed publication, conference presentation, and curriculum development for healthcare professionals in the emerging field of clinical AI.