Demographic inaccuracies and biases in the depiction of patients by artificial intelligence text-to-image generators

AI’s Patient Images Show Demographic Biases

One-Sentence Summary

This study reveals that leading AI text-to-image generators produce patient depictions with significant demographic inaccuracies, over-representing White and normal-weight individuals while failing to reflect real-world disease epidemiology.

Overview

As artificial intelligence (AI) text-to-image generators become widely used for creating visual content, their application in medical contexts raises concerns about accuracy and bias. This research systematically evaluated four popular AI models—Adobe Firefly, Bing Image Generator, Meta Imagine, and Midjourney—to assess how accurately they depict patients for 29 different diseases. Researchers generated a total of 9060 images and had twelve independent raters assess the depicted sex, age, race/ethnicity, and weight. These AI-generated demographics were then compared against established, real-world epidemiological data for each disease. The findings indicate a consistent failure across all platforms to accurately represent patient populations. A pronounced bias was observed toward the over-representation of White individuals, who constituted 87% of images from Adobe and 78% from Midjourney, compared to a pooled real-world average of 20%. Similarly, normal-weight individuals were over-represented, making up 96% of Adobe’s and 93% of Midjourney’s outputs, far exceeding the general population average of 63%.

Novelty

While previous studies have identified general demographic biases in AI image generators, this paper provides a specific and systematic analysis within the medical domain. Its novelty lies in directly comparing the outputs of multiple leading AI models against concrete, real-world epidemiological data for a wide range of diseases. The study moves beyond general observations of bias by quantifying the inaccuracies in depictions of disease-specific populations. For instance, it evaluates whether the AI correctly generates images of children for pediatric diseases or males for male-specific conditions. This rigorous, disease-contextualized approach provides specific evidence of the models’ shortcomings in a field where accuracy is critical, highlighting a significant gap between the technology’s capabilities and the requirements for responsible medical use.

My Perspective

The inaccuracies documented in this paper likely stem not only from biased training data but also from the AI developers’ attempts to mitigate bias, which can lead to unintended consequences. For example, the depiction of both men and women for sex-specific diseases like prostate cancer suggests an over-correction, where a general directive to ensure gender balance overrides specific, context-dependent facts. This reveals a lack of nuanced understanding within the models. Furthermore, the perpetuation of these biases in medical illustrations or educational materials is particularly concerning. It could inadvertently reinforce stereotypes among healthcare students and professionals, potentially leading them to associate certain diseases with specific demographics. This could subtly influence clinical judgment and contribute to diagnostic delays for patients who do not fit the stereotypical image presented by these tools.

Potential Clinical / Research Applications

Clinically, these findings serve as a strong caution against the uncritical use of AI-generated images for patient education or medical training. Healthcare professionals and educators who use these tools must be aware of their current limitations and should manually curate or edit images to ensure they reflect accurate patient diversity. For research, this study opens several important avenues. It highlights the need for developing and fine-tuning AI models on more diverse and medically relevant datasets that include a wider range of patient demographics. Future research could explore the effectiveness of advanced “prompt engineering”—using highly detailed text commands to specify patient characteristics—in reducing these biases. Additionally, this work can inform the development of standards and guidelines for the use of generative AI in healthcare, pushing developers to prioritize demographic accuracy and transparency in their models.

Similar Posts

  • Assessing Parkinson’s Gait via Smartphone Video and AI

    Original Title: Deep learning-enabled accurate assessment of gait impairments in Parkinson's disease using smartphone videos Journal: NPJ digital medicine DOI: 10.1038/s41746-025-02150-8 Overview Parkinson's disease affects millions globally, with gait impairments serving as a primary source of disability. Traditional assessment relies on the Unified Parkinson’s Disease Rating Scale, which is often subjective and lacks the sensitivity to detect minor changes. This study introduces a deep learning framework that utilizes videos recorded by a single smartphone from lateral perspectives to evaluate gait. The system employs a Siamese contrastive network architecture to fuse information from both sides of the body. In testing, the model achieved a micro-average area under the receiver operating characteristic…

  • AI-driven preclinical disease risk assessment using imaging in UK biobank

    AI Predicts Disease Risk Using MRI Scans One-Sentence Summary This study demonstrates that artificial intelligence models can predict the three-year risk of developing various diseases before clinical symptoms appear by integrating information from whole-body MRI scans with non-imaging health data. Overview Identifying individuals at high risk for disease before symptoms emerge is crucial for effective prevention. This research explores the use of artificial intelligence (AI) to assess preclinical disease risk by combining different types of health data. Using the large-scale UK Biobank dataset, the authors developed and evaluated AI models to predict the 3-year risk for seven conditions: cardiovascular disease (CVD), pancreatic disease, liver disease, cancer, chronic obstructive pulmonary disease…

  • Prognostic AI for Glioblastoma: A Methodological Critique

    Original Title: Letter to the editor: deep learning-based radiomics and machine learning for prognostic assessment in IDH-wildtype glioblastoma after maximal safe surgical resection: a multicenter study Journal: International journal of surgery (London, England) DOI: 10.1097/JS9.0000000000003221 Overview This letter to the editor discusses a multicenter study conducted by Liu and colleagues, which utilized deep learning-based radiomics to predict survival outcomes in patients with IDH-wildtype glioblastoma. The original research employed architectures including DenseNet and Swin Transformer to analyze medical imaging data and generate prognostic assessments following maximal safe surgical resection. While the study represents a step forward in integrating artificial intelligence with neuro-oncology, the authors of the letter highlight three methodological areas…

  • Robust CRC Diagnosis via Causal and Uncertainty-Aware AI

    Original Title: Uncertainty-aware and causal test-time adaptive foundation model for robust colorectal cancer pathology diagnosis Journal: NPJ digital medicine DOI: 10.1038/s41746-025-02149-1 Overview Colorectal cancer remains a major global health challenge, requiring precise histopathological analysis for effective treatment. While computational pathology has advanced with the use of large-scale foundation models, these systems frequently encounter obstacles when deployed in real-world clinical settings. Key issues include domain shifts caused by variations in staining protocols and scanner hardware, as well as the tendency for models to provide overconfident yet incorrect predictions. This paper introduces UAD-FM, an uncertainty-aware and causally adaptive foundation model designed to address these limitations. The framework integrates a variational Bayesian approach…

  • Multimodal AI for Predicting IVF Pregnancy Outcomes

    Original Title: Multimodal intelligent prediction model for in vitro fertilization Journal: NPJ digital medicine DOI: 10.1038/s41746-025-02331-5 Overview This study introduces VaTEP, a multimodal deep learning framework that integrates time-lapse system videos of developing embryos with tabular clinical data. Developed and validated using data from 9,786 participants across three medical centers, VaTEP predicts three clinical outcomes: fetal heartbeat presence, singleton versus multiple pregnancy, and miscarriage versus live birth. Using a multi-task learning approach, the system optimizes these predictions simultaneously. Results show the model achieved an area under the curve (AUC) of 0.8000 for fetal heartbeat, 0.8823 for singleton versus multiple pregnancy, and 0.9258 for live birth versus miscarriage. These values exceeded…

  • Aviation Lessons for Human-AI Collaboration in Medicine

    Original Title: Flight rules for clinical AI: lessons from aviation for human-AI collaboration in medicine Journal: NPJ digital medicine DOI: 10.1038/s41746-026-02410-1 Overview Medicine and aviation are high-stakes fields where safety is paramount. Over the past decades, healthcare has adopted various aviation safety tools, such as surgical checklists and incident reporting systems. However, as artificial intelligence (AI) becomes more integrated into clinical workflows, new challenges arise that mirror aviation's earlier experiences with automation. The automation paradox describes a situation where increased automation erodes human skills and situational awareness, potentially leading to errors when the system fails. A study on AI-assisted colonoscopy showed a 6.0 percentage point absolute reduction in adenoma detection…

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA