Framework for Radiation Oncology Department-wide Evaluation and Implementation of Commercial Artificial Intelligence Autocontouring.

Overview

abstract

PURPOSE: Artificial intelligence (AI)-based autocontouring in radiation oncology has potential benefits such as standardization and time savings. However, commercial AI solutions require careful evaluation before clinical integration. We developed a multidimensional evaluation method to test pretrained AI-based automated contouring solutions across a network of clinics. METHODS AND MATERIALS: Curated data included 121 patient planning computed tomography (CT) scans with a total of 859 clinically approved contours used for treatment from 4 clinics. Regions of interest (ROIs) were generated with 3 commercial AI-based automated contouring software solutions (AI1, AI2, AI3) spanning the following disease sites: brain, head and neck (H&N), thorax, abdomen, and pelvis. Quantitative agreement between AI-generated and clinical contours was measured by Dice similarity coefficient (DSC) and Hausdorff distance (HD). Qualitative assessment was performed by multiple experts scoring blinded AI-contours using a Likert scale. Workflow and usability surveying was also conducted. RESULTS: AI1, AI2, and AI3 contours had high quantitative agreement in 27.8%, 32.8%, and 34.1% of cases (DSC >0.9), performing well in pelvis (median DSC = 0.86/0.88/0.91) and thorax (median DSC = 0.91/0.89/0.91). All 3 solutions had low quantitative agreement in 7.4%, 8.8%, and 6.1% of cases (DSC <0.5), performing worse in brain (median DSC = 0.65/0.78/0.75) and H&N (median DSC = 0.76/0.80/0.81). Qualitatively, AI1 and AI2 contours were acceptable (rated 1-2) with at most minor edits in 70.7% and 74.6% of ROIs (2906 ratings), higher for abdomen (AI1: 79.2%) and thorax (AI2: 90.2%), and lower for H&N (29.0/35.6%). An end-user survey showed strong user preference for full automation and mixed preferences for accuracy versus total number of structures generated. CONCLUSIONS: Our evaluation method provided a comprehensive analysis of both quantitative and qualitative measures of commercially available pretrained AI autocontouring algorithms. The evaluation framework served as a roadmap for clinical integration that aligned with user workflow preference.