Evaluating Large Language Models for Decision Support in Minimally Invasive Spine Surgery Triage and Procedural Categories. Academic Article uri icon

Overview

abstract

  • Study DesignVignette-based cross-sectional study.ObjectiveGenerative artificial intelligence (AI) programs such as large language models (LLMs) are reshaping treatment decision-making, yet applications in minimally invasive spine surgery (MISS) are still scarce. This study examines whether OpenAI's ChatGPT-5 Pro and Google's Gemini 2.5 Pro reproduce expert management categories from published MISS cases and measures agreement at procedural and binary triage levels.MethodsWe constructed 90 clinical vignettes from published case reports and prompted each LLM to assign 1 or more of ten predefined categories with two-sentence rationales. Agreement with reference was assessed using Jensen-Shannon divergence (JSD), Stuart-Maxwell tests, Cohen's κ, and McNemar's test for surgical vs non-surgical triage.ResultsDivergence from reference was small, with Jensen-Shannon divergence 0.115 (ChatGPT-5 Pro) and 0.112 (Gemini 2.5 Pro), and smaller between models at 0.073. Paired multinomial tests found differences from the reference (Stuart-Maxwell χ2(9) = 24.8 and 26.0; P = 0.007 and 0.006) but not between models (14.4; P = 0.108). Case-level agreement was slight for ChatGPT-5 Pro and fair for Gemini 2.5 Pro (κ = 0.146 and 0.245). Collapsing categories to surgical vs non-surgical improved agreement (κ = 0.415 and 0.587 vs reference; 0.692 between models) with no bias in rates (P ≥ 0.401).ConclusionsLLMs may differentiate between surgical and non-surgical triage, but procedure selection should remain expert-led until systems mature. These findings establish a baseline for integrating LLMs into surgical triage workflows and highlight promise and limitations of generative AI in precision spine care.

publication date

  • December 22, 2025

Identity

Digital Object Identifier (DOI)

  • 10.1177/21925682251411225

PubMed ID

  • 41424195