We also have X and podcasts
LLMs Fall Short on Clinical Reasoning: New Benchmark Reveals Critical Gaps in Differential Diagnosis
A comprehensive evaluation of 21 state-of-the-art large language models reveals significant limitations in clinical reasoning, particularly in differential diagnosis, prompting researchers to recommend supervised, targeted deployment only.
Background
Large language models have increasingly captured attention in healthcare, yet their capacity for clinical reasoning—the cognitive process physicians use to diagnose and manage patients—remains poorly understood. Researchers evaluated 21 leading models (including GPT-5, Claude 4.5 Opus, Gemini 3.0, and Grok 4) using a new framework called PrIME-LLM to systematically assess clinical-grade AI performance.
Key Findings
- PrIME-LLM scores ranged from 0.64 (Gemini 1.5 Flash) to 0.78 (Grok 4), with reasoning-optimized models significantly outperforming non-reasoning models (0.76 vs 0.67, p<0.001)
- Differential diagnosis showed the poorest performance across all models with failure rates exceeding 0.80
- Final diagnosis demonstrated the highest accuracy with failure rates below 0.40
- The PrIME-LLM framework revealed critical reasoning gaps obscured by traditional accuracy metrics
- Multimodal improvements with image inputs were limited and inconsistent, with only 7 of 18 multimodal models showing significant gains
Why It Matters
These findings challenge optimism about LLM readiness for clinical deployment. While reasoning-optimized models show meaningful improvements, fundamental limitations in differential diagnosis—arguably the most critical clinical reasoning task—persist across model generations. The research indicates current LLMs should be restricted to supervised settings with low diagnostic uncertainty, not deployed as autonomous patient-facing tools.
Limitations
The evaluation used 29 standardized clinical vignettes from the MSD Manual, which may not fully capture the complexity of real-world clinical scenarios with nuanced histories and contradictory findings.
Original paper: Large Language Model Performance and Clinical Reasoning Tasks. — JAMA Network Open. 10.1001/jamanetworkopen.2026.4003




