AI-Powered Laryngoscopy: Exploring the Future With Google Gemini.
Academic Article
Overview
abstract
Foundation models (FMs) are general-purpose artificial intelligence (AI) neural networks trained on massive datasets, including code, text, audio, images, and video, to handle myriad tasks from generating texts to analyzing images or composing music. We evaluated Google Gemini 1.5 Pro, currently the largest token context window multimodal FM and best-performing commercial model for video analysis, for interpreting laryngoscopy frames and videos from Google Images and YouTube. Gemini recognized the procedure as laryngoscopy in 87/88 frames (98.9%) and in 15/15 video-laryngoscopies (100%), accurately diagnosed a pathology in 55/88 frames (62.5%) and 3/15 videos (20.0%), identified lesion sides in 58/88 frames (65.9%) and 6/15 videos (40%) and narrated two operative video-laryngoscopies without fine-tuning. Findings suggest that Gemini 1.5 Pro shows significant potential for analyzing laryngoscopy, demonstrating the potential for FMs as clinical decision support tools in complex expert tasks in otolaryngology. LEVEL OF EVIDENCE: 3.