Automating Expert-Level Medical Reasoning Evaluation for AI

Original Title: Automating expert-level medical reasoning evaluation of large language models

Journal: NPJ digital medicine

DOI: 10.1038/s41746-025-02208-7

Overview

Large language models increasingly assist in clinical decision-making, yet their internal reasoning processes often remain opaque. Current evaluation methods frequently rely on multiple-choice question accuracy, which fails to capture whether a model reached a correct conclusion through sound medical logic or mere pattern matching. While human expert review provides a highly reliable assessment, it is time-consuming and difficult to scale. To address these limitations, researchers developed MedThink-Bench, a dataset of 500 complex medical questions across ten domains, including pathology and pharmacology. Each question is paired with expert-authored, step-by-step reasoning paths. Alongside this benchmark, the study introduced LLM-w-Rationale, an automated evaluation framework. This system uses a secondary language model as a judge to compare a target model’s reasoning against the expert-provided steps. Experimental results indicate that LLM-w-Rationale correlates strongly with human experts, achieving a Pearson coefficient of 0.87. Furthermore, the automated process required only 51.8 minutes to evaluate the entire dataset, representing just 1.4% of the 3708.3 minutes required for manual human assessment.

Novelty

The primary contribution of this work is the creation of a high-quality, expert-curated benchmark that prioritizes the logic of a medical answer over the final selection. Unlike previous datasets that often used AI-generated rationales as ground truth, MedThink-Bench utilizes reasoning trajectories manually verified by a team of ten medical professionals. The LLM-w-Rationale framework introduces a specific one-to-many comparison logic. Instead of requiring a model to match an expert's phrasing exactly, the judge model determines if the model’s overall explanation adequately supports each discrete expert reasoning step. This approach allows for flexibility in language while maintaining strict logical standards. A notable finding enabled by this benchmark is the divergence between prediction accuracy and reasoning quality. For instance, the model MedGemma-27B achieved an expert reasoning score of 0.759, whereas OpenAI-o3, despite having a higher multiple-choice accuracy of 0.692, scored only 0.384 in reasoning quality. This highlights that many models may produce correct answers through flawed logic.

Potential Clinical / Research Applications

This framework offers a practical method for the rapid benchmarking of new medical AI models before they are deployed in clinical settings. Researchers can use it to identify specific domains where a model is prone to "hallucination" or logical failures, such as in complex diagnostic workups or pharmacology. In clinical environments, this system could serve as an automated auditing tool, flagging instances where an AI assistant provides a recommendation without sufficient logical justification. Additionally, it has potential as an educational resource in medical training. By comparing student reasoning paths against the expert trajectories in MedThink-Bench, the system could provide objective feedback on clinical logic. The efficiency and cost-effectiveness of the LLM-w-Rationale framework, which costs approximately 0.80 dollars per evaluation run using commercial APIs, make it accessible for smaller research institutions and hospitals looking to validate their internal AI tools.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA