Custom AI Models Beat Generic ChatGPT for Medical Radiology Reports

A new study shows that radiology AI models fine-tuned on institutional data outperform general-purpose language models like GPT-4.1 in clinical acceptance, challenging assumptions about off-the-shelf AI solutions in healthcare.

Background

As artificial intelligence increasingly enters clinical practice, radiologists face a choice: adopt general-purpose language models or implement custom AI systems trained on institutional data. This study compared both approaches by having 10 clinicians—including radiologists and oncologists—evaluate AI-generated impressions against human-authored reports in 200 oncologic CT cases from an academic cancer center.

Key Findings

  • A custom domain-specific model achieved near parity with human impressions, with original radiologists showing only marginal preference for their own work (p=0.0716)
  • GPT-4.1 generated impressions averaging 75 words versus 41 for human reports, significantly reducing conciseness (p<0.001)
  • Radiologists strongly disfavored generic model impressions (Cohen’s h=1.04–1.22, p<0.001), while oncologists showed no significant preference
  • Patient harm risk remained uniformly low across all impression types, suggesting safety was not compromised by any approach
  • Inter-rater reliability was low (α=−0.09 to 0.67), indicating impression quality assessment is inherently subjective across evaluators

Why It Matters

These results suggest domain-specific fine-tuning is essential for clinical AI acceptance. Rather than replacing radiologists, AI impressions work best as flexible drafting tools that reduce cognitive burden while preserving clinician oversight. Implementation should match stakeholder preferences rather than assume a single objective standard for quality.

Limitations

The study involved a single institutional dataset and focused on oncologic CT. Generalizability to other imaging modalities and healthcare settings remains unclear. Additionally, the inherent subjectivity in ratings limits consensus on what constitutes an optimal impression.

Original paper: Comparison of AI-generated radiology impressions: a multi-stakeholder evaluation. — NPJ digital medicine. 10.1038/s41746-026-02586-6

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA