Generative Large Language Models Trained for Detecting Errors in Radiology Reports. Academic Article uri icon

Overview

abstract

  • Background Large language models (LLMs) offer promising solutions, yet their application in medical proofreading, particularly in detecting errors within radiology reports, remains underexplored. Purpose To develop and evaluate generative LLMs for detecting errors in radiology reports during medical proofreading. Materials and Methods In this retrospective study, a dataset was constructed with two parts. The first part included 1656 synthetic chest radiology reports generated by GPT-4 (OpenAI) using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC chest radiograph (MIMIC-CXR) database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3 (Meta AI), GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using F1 scores, 95% CIs, and paired-sample t tests on the constructed dataset, with the prediction results further assessed by radiologists. Results Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance, with the following F1 scores: 0.769 (95% CI: 0.757, 0.771) for negation errors, 0.772 (95% CI: 0.762, 0.780) for left/right errors, 0.750 (95% CI: 0.736, 0.763) for interval change errors, 0.828 (95% CI: 0.822, 0.832) for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model (50 for each error type). Of these, 99 were confirmed by both radiologists to contain errors detected by the models, and 163 were confirmed by at least one radiologist to contain model-detected errors. Conclusion Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Marrocchio and Sverzellati in this issue.

publication date

  • May 1, 2025

Research

keywords

  • Diagnostic Errors
  • Radiology Information Systems

Identity

Digital Object Identifier (DOI)

  • 10.1148/radiol.242575

PubMed ID

  • 40392090

Additional Document Info

volume

  • 315

issue

  • 2