Generative Large Language Models Trained for Detecting Errors in Radiology Reports.

Overview

abstract

Background Large language models (LLMs) offer promising solutions, yet their application in medical proofreading, particularly in detecting errors within radiology reports, remains underexplored. Purpose To develop and evaluate generative LLMs for detecting errors in radiology reports during medical proofreading. Materials and Methods In this retrospective study, a dataset was constructed with two parts. The first part included 1656 synthetic chest radiology reports generated by GPT-4 (OpenAI) using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC chest radiograph (MIMIC-CXR) database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3 (Meta AI), GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using F1 scores, 95% CIs, and paired-sample t tests on the constructed dataset, with the prediction results further assessed by radiologists. Results Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance, with the following F1 scores: 0.769 (95% CI: 0.757, 0.771) for negation errors, 0.772 (95% CI: 0.762, 0.780) for left/right errors, 0.750 (95% CI: 0.736, 0.763) for interval change errors, 0.828 (95% CI: 0.822, 0.832) for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model (50 for each error type). Of these, 99 were confirmed by both radiologists to contain errors detected by the models, and 163 were confirmed by at least one radiologist to contain model-detected errors. Conclusion Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Marrocchio and Sverzellati in this issue.

authors

Sun, Cong
Teichman, Kurt
Zhou, Yiliang
Critelli, Brian
Nauheim, David
Keir, Graham
Wang, Xindi
Zhong, Judy
Flanders, Adam E
Shih, George
Peng, Yifan

publication date

May 1, 2025

published in

Radiology Journal

Research

keywords

Diagnostic Errors
Radiology Information Systems

Identity

Digital Object Identifier (DOI)

10.1148/radiol.242575

PubMed ID

40392090

Additional Document Info

volume

315

issue

2

VIVO Weill Cornell Medical College

Generative Large Language Models Trained for Detecting Errors in Radiology Reports. Academic Article

Overview

abstract

authors

publication date

published in

Research

keywords

Identity

Digital Object Identifier (DOI)

PubMed ID

Additional Document Info

volume

issue