Developing and Evaluating Large Language Model-Generated Emergency Medicine Handoff Notes.

Overview

abstract

IMPORTANCE: An emergency medicine (EM) handoff note generated by a large language model (LLM) has the potential to reduce physician documentation burden without compromising the safety of EM-to-inpatient (IP) handoffs. OBJECTIVE: To develop LLM-generated EM-to-IP handoff notes and evaluate their accuracy and safety compared with physician-written notes. DESIGN, SETTING, AND PARTICIPANTS: This cohort study used EM patient medical records with acute hospital admissions that occurred in 2023 at NewYork-Presbyterian/Weill Cornell Medical Center. A customized clinical LLM pipeline was trained, tested, and evaluated to generate templated EM-to-IP handoff notes. Using both conventional automated methods (ie, recall-oriented understudy for gisting evaluation [ROUGE], bidirectional encoder representations from transformers score [BERTScore], and source chunking approach for large-scale inconsistency evaluation [SCALE]) and a novel patient safety-focused framework, LLM-generated handoff notes vs physician-written notes were compared. Data were analyzed from October 2023 to March 2024. EXPOSURE: LLM-generated EM handoff notes. MAIN OUTCOMES AND MEASURES: LLM-generated handoff notes were evaluated for (1) lexical similarity with respect to physician-written notes using ROUGE and BERTScore; (2) fidelity with respect to source notes using SCALE; and (3) readability, completeness, curation, correctness, usefulness, and implications for patient safety using a novel framework. RESULTS: In this study of 1600 EM patient records (832 [52%] female and mean [SD] age of 59.9 [18.9] years), LLM-generated handoff notes, compared with physician-written ones, had higher ROUGE (0.322 vs 0.088), BERTScore (0.859 vs 0.796), and SCALE scores (0.691 vs 0.456), indicating the LLM-generated summaries exhibited greater similarity and more detail. As reviewed by 3 board-certified EM physicians, a subsample of 50 LLM-generated summaries had a mean (SD) usefulness score of 4.04 (0.86) out of 5 (compared with 4.36 [0.71] for physician-written) and mean (SD) patient safety scores of 4.06 (0.86) out of 5 (compared with 4.50 [0.56] for physician-written). None of the LLM-generated summaries were classified as a critical patient safety risk. CONCLUSIONS AND RELEVANCE: In this cohort study of 1600 EM patient medical records, LLM-generated EM-to-IP handoff notes were determined superior compared with physician-written summaries via conventional automated evaluation methods, but marginally inferior in usefulness and safety via a novel evaluation framework. This study suggests the importance of a physician-in-loop implementation design for this model and demonstrates an effective strategy to measure preimplementation patient safety of LLM models.

authors

Hartman, Vince

Zhang, Xinyuan

Poddar, Ritika

McCarty, Matthew
Fortenko, Alexander
Sholle, Evan
Sharma, Rahul
Campion, Thomas
Steel, Peter

publication date

December 2, 2024

published in

JAMA network open Journal

Research

keywords

Electronic Health Records
Emergency Medicine
Patient Handoff

Identity

PubMed Central ID

PMC11615705

Scopus Document Identifier

85211418758

Digital Object Identifier (DOI)

10.1001/jamanetworkopen.2024.48723

PubMed ID

39625719

Additional Document Info

has global citation frequency

28

volume

7

issue

12

VIVO Weill Cornell Medical College

Developing and Evaluating Large Language Model-Generated Emergency Medicine Handoff Notes. Academic Article

Overview

abstract

authors

publication date

published in

Research

keywords

Identity

PubMed Central ID

Scopus Document Identifier

Digital Object Identifier (DOI)

PubMed ID

Additional Document Info

has global citation frequency

volume

issue