A study of 691 FDA-cleared AI/ML devices reveals significant reporting gaps in efficacy, safety, and bias, calling for better regulation.

Original Title: Benefit-Risk Reporting for FDA-Cleared Artificial Intelligence-Enabled Medical Devices

Journal: JAMA health forum

DOI: 10.1001/jamahealthforum.2025.3351

FDA AI/ML Device Reporting Lacks Transparency

Overview

A comprehensive analysis of 691 artificial intelligence and machine learning (AI/ML) medical devices cleared by the US Food and Drug Administration (FDA) between 1995 and 2023 reveals significant deficiencies in benefit-risk reporting. The cross-sectional study examined FDA decision summaries and postmarket surveillance databases. It found that crucial information was frequently missing. For instance, 95.5% of device summaries lacked demographic data for the populations on which the AI was tested, 53.3% did not report the training sample size, and 46.7% omitted the study design. The evidence supporting clearance was often not robust; only 1.6% of devices were backed by data from randomized clinical trials. Postmarket issues were also identified, with 5.2% of devices linked to 489 adverse events, including one death, and 5.8% of devices being recalled.

Novelty

This research provides a uniquely comprehensive assessment by linking premarket clearance data with postmarket safety information from adverse event and recall databases for all FDA-cleared AI/ML devices over a 28-year period. While previous studies have examined aspects of AI/ML device approvals, this work is distinct in its scale and its integrated analysis of the full device lifecycle. The study also introduces a temporal analysis, comparing devices cleared before 2021 with those cleared in or after 2021. This comparison showed that while reporting of demographic bias and efficacy has improved recently (19.3% vs 2.5% and 35.8% vs 23.8%, respectively), reporting of safety assessments and association with peer-reviewed publications has declined (13.4% vs 36.8% and 31.9% vs 43.7%, respectively).

My Perspective

I find these results highlight a critical tension between the rapid advancement of AI technology and the existing regulatory frameworks designed for more conventional medical devices. The FDA’s reliance on the 510(k) pathway, which clears most AI/ML devices based on “substantial equivalence” to a prior device, seems ill-suited for this technology. An AI algorithm is fundamentally different from a physical instrument; its performance is highly dependent on the data it was trained on. The finding that 95.5% of submissions lack demographic details is deeply concerning. This practice risks creating and perpetuating health disparities, as a tool validated on one population may not perform safely or effectively on another. It suggests a systemic failure to prioritize equity in the development and approval process for these influential clinical tools.

Potential Clinical / Research Applications

In clinical practice, these findings should prompt healthcare providers and institutions to exercise caution when adopting new AI/ML technologies. Clinicians should critically evaluate the evidence provided by manufacturers, specifically questioning the diversity of the training and validation data and demanding transparency on performance metrics before integrating these tools into patient care. For the research community, this study underscores the need to develop and advocate for standardized reporting guidelines for AI/ML device evaluations, analogous to frameworks used for clinical trials. Future research could also focus on creating robust, independent postmarket surveillance systems, perhaps by leveraging real-world data from electronic health records to monitor the performance of these devices after deployment and detect performance degradation or biases not apparent in premarket testing.

Similar Posts

  • Role of stem-like cells in chemotherapy resistance and relapse in pediatric T-cell acute lymphoblastic leukemia

    Title Stem-like Cells Drive T-ALL Relapse One-Sentence Summary This study identifies a subpopulation of quiescent, stem-like leukemia cells that expands at relapse in pediatric T-cell acute lymphoblastic leukemia, linking their chemotherapy resistance to specific transcriptional and splicing programs. Overview Relapse in pediatric T-cell acute lymphoblastic leukemia (T-ALL) is associated with chemotherapy resistance and poor outcomes. To understand the underlying mechanisms, this research conducted longitudinal single-cell RNA sequencing on patient-derived samples collected at both diagnosis and relapse. The analysis included 13 patients who relapsed and 5 who did not. The study identified a distinct subpopulation of T-ALL cells with stem-like characteristics in 11 of the 18 patient samples. These cells, which…

  • Interpretable Survival Analysis for Alzheimer’s Progression

    Original Title: Basic Science and Pathogenesis Journal: Alzheimer's & dementia : the journal of the Alzheimer's Association DOI: 10.1002/alz70855_107083 Overview This research addresses the challenge of predicting the progression of Alzheimer’s disease and related dementias using survival analysis. While deep learning models offer high predictive performance, their complex architectures often obscure the biological factors driving their outputs. To resolve this, the authors introduce the Neural Additive Deep Clustering Survival Machines (NADCSM) framework. This model utilizes data from the Alzheimer’s Disease Neuroimaging Initiative, specifically focusing on AV45 Florbetapir PET imaging, genotyping, and demographic information to track the transition from mild cognitive impairment to early Alzheimer’s disease. The framework models survival times…

  • Non-coding genetic elements of lung cancer identified using whole genome sequencing in 13,722 Chinese

    Title Lung Cancer’s Non-Coding Genetic Drivers One-Sentence Summary A whole-genome sequencing study of 13,722 Chinese individuals identifies common and rare non-coding genetic variants associated with lung cancer, implicating novel genes and regulatory pathways. Overview This study investigated the genetic basis of lung cancer in the Chinese population, focusing on non-coding regions of the genome that regulate gene activity. Researchers performed whole-genome sequencing on 13,722 individuals and analyzed both common and rare genetic variants. For common variants, the analysis confirmed associations with known genes like TP63 and, through a transcriptome-wide association study (TWAS), linked the expression of eight genes to lung cancer risk. The analysis of rare variants, which are less…

  • Plexin-B2 in CTC Clustering and Breast Cancer Metastasis

    Original Title: Computational ranking identifies Plexin-B2 in circulating tumor cell clustering with monocytes in breast cancer metastasis Journal: Nature communications DOI: 10.1038/s41467-025-62862-z Overview Circulating tumor cell (CTC) clusters are significantly more effective at seeding metastases than single CTCs, but the molecular mechanisms driving their formation are not fully understood. This study employed a computational ranking system, integrating proteomic data from breast tumors and cell lines with clinical survival data, to identify key proteins involved in this process. The analysis pinpointed Plexin-B2 (PLXNB2) as a top candidate associated with poor patient outcomes. In clinical samples, high PLXNB2 expression was enriched in CTC clusters and correlated with unfavorable overall survival (Hazard Ratio…

  • AI-Driven Molecular Subtyping for Leiomyosarcoma Trials

    Original Title: Navigating the digital health landscape from artificial intelligence-driven molecular subtyping towards optimized rare sarcoma trial design Journal: International journal of surgery (London, England) DOI: 10.1097/JS9.0000000000003040 Overview This correspondence discusses a deep learning framework developed by He and colleagues for the molecular subtyping of leiomyosarcoma using histopathological images. The original study introduced the LMS_DL model, which analyzes single hematoxylin and eosin whole-slide images to predict molecular subtypes. This model achieved an area under the receiver operating characteristic curve (AUROC) of approximately 0.944. Furthermore, the researchers established a prognostic algorithm for predicting two-year overall survival, yielding an AUROC of approximately 0.937. The letter emphasizes how these technical achievements can be…

  • Evolution and Integration of Telerobotic Surgery Systems

    Original Title: Telerobotic surgery: a comprehensive two-decade evolution and the integration of emerging technologies Journal: International journal of surgery (London, England) DOI: 10.1097/JS9.0000000000003484 Overview OverviewTelerobotic surgery has progressed from early conceptual stages to a sophisticated clinical reality over the last two decades. Initially envisioned by NASA in 1972 for orbital missions, the field achieved a major milestone in 2001 with the first transatlantic laparoscopic cholecystectomy, which operated over 14,000 kilometers with a transmission latency of 155 milliseconds. Since then, the technology has expanded across multiple medical disciplines. In urology, robotic systems were utilized in 42% of radical prostatectomies in the United States by 2006. More recently, the integration of high-speed…

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA