Medical Imaging Applications Developed Using Artificial Intelligence Demonstrate High Internal Validity Yet Are Limited in Scope and Lack External Validation.
Review
Overview
abstract
PURPOSE: To (1) review definitions and concepts necessary to interpret applications of deep learning (DL; a domain of artificial intelligence (AI) that leverages neural networks to make predictions on media inputs such as images); and (2) identify knowledge and translational gaps in the literature to provide insight into specific areas for improvement as adoption of this technology continues. METHODS: A comprehensive search of the literature was performed in December 2023 for articles regarding the use of DL in sports medicine. For each study, information regarding the joint of focus, specific anatomic structure/pathology to which DL was applied, imaging modality utilized, source of images used for model training and testing, dataset size, model performance, and whether the DL model was externally validated was recorded. A numerical scale was used to rate each DL model's clinical impact, with one corresponding to proof-of-concept studies with little-to-no direct clinical impact and five corresponding to practice-changing clinical impact and ready for clinical deployment. RESULTS: Fifty-five studies were identified, all of which were published within the past five years, while 82% were published within the past three years. Of the DL models identified, 84% were developed for classification tasks, 9% for automated measurements, and 7% for segmentation. A total of 62% of studies utilized MRI as the imaging modality, 25% radiographs, 7% ultrasound, while one study each used CT, arthroscopic images, or arthroscopic video. Sixty-five percent of studies focused on the detection of tears (anterior cruciate ligament (ACL), rotator cuff (RC), and meniscus). The diagnostic performance of ACL tears as determined by the area under the receiver operator curve (AUROC) ranged from 0.81-0.99 for ACL tears (excellent to near-perfect), 0.83-0.94 for RC tears (excellent), and from 0.75-0.96 for meniscus tears (acceptable to excellent). In addition, three studies focused on detection of cartilage lesions had AUC ranging from 0.90-0.92 (excellent performance). However, only four (7%) studies externally validated their models, suggesting that they may not be generalizable to, or may not perform well when applied to, populations other than that used to develop the model. Finally, the mean clinical impact score was 2 (range, 1-3) on scale of 1-5, corresponding to limited clinical applicability. CONCLUSION: DL models in orthopaedic sports medicine show generally excellent performance (high internal validity), but require external validation to facilitate clinical deployment. In addition, current models have low clinical applicability and fail to advance the field due to a focus on routine tasks and a narrow conceptual framework.