Large Language Model Use Cases in Health Care Research Are Redundant and Often Lack Appropriate Methodological Conduct: A Scoping Review and Call for Improved Practices.
Review
Overview
abstract
PURPOSE: To describe the current use cases of large language models (LLMs) in musculoskeletal medicine and to evaluate the methodologic conduct of these investigations in order to safeguard future implementation of LLMs in clinical research and identify key areas for methodological improvement. METHODS: A comprehensive literature search was performed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines using PubMed, Cochrane Library, and Embase databases to identify eligible studies. Included studies evaluated the use of LLMs within any realm of orthopaedic surgery, regardless of its application in a clinical or educational setting. Methodological Index for Non-Randomized Studies criteria was used to assess the quality of all included studies. RESULTS: In total, 114 studies published from 2022 to 2024 were identified. Extensive use case redundancy was observed, and 5 main categories of clinical applications of LLMs were identified: 48 studies (42.1%) that assessed the ability to answer patient questions, 24 studies (21.1%) that evaluated the ability to diagnose and manage medical conditions, 21 studies (18.4%) that evaluated the ability to take orthopaedic examinations, 11 studies (9.6%) that analyzed the ability to develop or evaluate patient educational materials, and 10 studies (8.8%) concerning other applications, such as generating images, generating discharge documents and clinical letters, writing scientific abstracts and manuscripts, and enhancing billing efficiency. General orthopaedics was the focus of most included studies (n = 39, 34.2%), followed by orthopaedic sports medicine (n = 18, 15.8%), and adult reconstructive surgery (n = 17, 14.9%). ChatGPT 3.5 was the most common LLM used or evaluated (n = 79, 69.2%), followed by ChatGPT 4.0 (n = 47, 41.2%). Methodological inconsistency was prevalent among studies, with 36 (31.6%) studies failing to disclose the exact prompts used, 64 (56.1%) failing to disclose the exact outputs generated by the LLM, and only 7 (6.1%) evaluating different prompting strategies to elicit desired outputs. No studies attempted to investigate how the influence of race or gender influenced model outputs. CONCLUSIONS: Among studies evaluating LLM health care use cases, the scope of clinical investigations was limited, with most studies showing redundant use cases. Because of infrequently reported descriptions of prompting strategies, incomplete model specifications, failure to disclose exact model outputs, and limited attempts to address bias, methodological inconsistency was concerningly extensive. CLINICAL RELEVANCE: A comprehensive understanding of current LLM use cases is critical to familiarize providers with the possibilities through which this technology may be used in clinical practice. As LLM health care applications transition from research to clinical integration, model transparency and trustworthiness is critical. The results of the current study suggest that guidance is urgently needed, with focus on promoting appropriate methodological conduct practices and novel use cases to advance the field.