LLMs for De-identifying Sensitive Health Information

Original Title: Leveraging large language models for the deidentification and temporal normalization of sensitive health information in electronic health records

Journal: NPJ digital medicine

DOI: 10.1038/s41746-025-01921-7

Overview

Overview
Sharing electronic health records (EHRs) for research is vital but requires the removal of sensitive health information (SHI) to protect patient privacy. This process, known as de-identification, also involves temporal normalization, which standardizes date and time expressions to preserve a coherent patient timeline. This paper evaluates the effectiveness of large language models (LLMs) for these two tasks. It presents a detailed analysis based on the SREDH/AI CUP 2023 competition, which challenged 291 teams to develop systems for SHI recognition and temporal normalization using a dataset of 3,244 pathology reports. The study systematically compares various approaches, including in-context learning, full model fine-tuning, and more efficient tuning methods, across a range of LLM sizes to establish performance baselines and identify effective strategies.

Novelty

Novelty
The study provides a comprehensive performance analysis of the Pythia suite of LLMs, ranging from 70 million to 12 billion parameters, on de-identification tasks. A key finding is an "inverse scaling issue," where model performance plateaus or even degrades with fine-tuning on models larger than 6 billion parameters, likely due to overfitting on the moderately sized dataset. Optimal performance was observed with a 2.8 billion parameter model. The research systematically compares different training strategies, showing that parameter-efficient fine-tuning (LoRA) can outperform traditional full-parameter fine-tuning, especially for larger models. The analysis of the competition results, where 77.2% of teams used LLMs, reveals that the most successful systems were often hybrid models. These combined LLMs for contextual understanding with pattern-based rules for precision, particularly for structured information. The top-performing system for SHI recognition achieved a macro-F1 score of 0.881, while the best for temporal normalization scored 0.869.

Potential Clinical / Research Applications

Potential Clinical / Research Applications
The methods evaluated in this paper can directly facilitate the creation of large, safe, and high-quality datasets from clinical notes for research purposes. Automated and reliable de-identification tools can accelerate studies on disease patterns, treatment outcomes, and social determinants of health by making more data accessible without compromising patient confidentiality. Clinically, robust temporal normalization is critical for accurately reconstructing patient histories from unstructured text. This capability can enhance clinical decision support systems by providing a clear timeline of events, which is fundamental for diagnosis and treatment planning. The study's findings also provide practical guidance for healthcare institutions on selecting cost-effective AI tools, showing that smaller, fine-tuned models can outperform larger, more resource-intensive ones for this specific task. This research paves the way for developing more advanced, trustworthy AI pipelines for processing sensitive medical information.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA