LLMs Prove Competitive for Clinical Prediction: New Study Challenges Conventional Wisdom

Large language models are now competitive—and often superior—to traditional machine learning for clinical prediction tasks, according to a comprehensive benchmark study.

Background

The clinical AI field has traditionally emphasized specialized models, assuming LLMs are ill-suited for prediction. ClinicRealm challenges this assumption by comparing 15 GPT-style LLMs, 5 BERT models, and 11 conventional ML/DL models on mortality prediction, readmission, and length-of-stay tasks using clinical notes and EHR data from MIMIC and TJH datasets.

Key Findings

Zero-shot state-of-the-art LLMs (DeepSeek-R1: 90.75% AUROC; GPT-5: 89.75%) substantially outperform fine-tuned BERT models (87.97%) on clinical note-based mortality prediction
Advanced LLMs show strong zero-shot capabilities and exceed conventional models in low-data (10-shot) settings
Open-source LLMs match or surpass proprietary models, democratizing access to high-performing clinical tools
Multimodal integration doesn’t uniformly improve performance; clinical notes often provide the dominant predictive signal
Zero-shot LLMs demonstrate greater fairness across demographic attributes compared to trained models
Human evaluation reveals high-quality reasoning but specific failure modes: hallucination-driven false positives and flawed clinical reasoning causing false negatives

Why It Matters

Healthcare institutions should reconsider model selection strategies. Modern LLMs are now practical alternatives for clinical prediction, particularly for unstructured data and zero-shot applications. Open-source models’ strong performance has important implications for resource-constrained systems. Specialized models remain optimal for structured EHR with ample training data, suggesting a complementary relationship rather than replacement.

Limitations

The computational design provides valuable benchmarking data but may not fully capture real-world deployment challenges, including regulatory compliance and clinical workflow integration. Reliance on specific datasets (MIMIC-III/IV, TJH) may limit generalizability across diverse healthcare systems and populations.

Original paper: ClinicRealm: Re-evaluating large language models with conventional machine learning for non-generative clinical prediction tasks. — NPJ digital medicine. 10.1038/s41746-026-02539-z

🎧 Listen to the podcast