Exploring Real-Time Tracking of Vocal Fold Polyps in Video-Stroboscopy Using Deep Learning.
Academic Article
Overview
abstract
OBJECTIVE: To develop and evaluate a deep learning object detection system for identifying vocal fold polyps in stroboscopic video frames using You Only Look Once (YOLO), and to assess the added benefit of temporal tracking on detection performance. METHODS: A retrospective dataset of 12,742 annotated frames from 55 laryngoscopy video recordings was annotated with bounding boxes identifying vocal fold polyps. Pretrained YOLO11 and YOLO12 models were fine-tuned to detect the polyps in the frames. A temporal tracking algorithm was further developed to propagate missed detections across adjacent frames. RESULTS: YOLO12 outperformed YOLO11 across all metrics. On the hold-out test set, YOLO12 reached a precision of 83.1% and an F1 score of 67.6%, with a mean average precision at 0.5 (mAP@0.5) of 64.1%. By comparison, YOLO11 achieved a precision of 67.3% and an F1 score of 56.2%, with a mAP@0.5 of 56.0%. Incorporating temporal tracking increased mAP@0.5 to 70.4% with YOLO 12, while maintaining a detection speed of 21.4 frames per second (fps), close to real time (30 fps). CONCLUSIONS: Using YOLO 12 for vocal fold polyp detection in stroboscopy was enhanced with temporal tracking, achieving a mAP@0.5 of 70.4% with near real time performance. These results demonstrate the potential of real-time AI-assisted detection of vocal fold lesions.