Multimodal multi-instance learning for cardiopulmonary exercise testing performance prediction.

Overview

Heart failure (HF) is a progressive and fatal disease that affects nearly 7 million individuals in the United States, with prevalence expected to surpass 10 million by 2040. Cardiopulmonary exercise testing (CPET) represents the gold standard for assessing functional capacity and predicting survival outcomes among HF patients but its widespread use is limited by practical constraints. Here we introduce a multimodal multi-instance learning framework that predicts peak oxygen consumption (peak VO₂), a critical indicator from CPET, using the more accessible transthoracic echocardiography (TTE) studies and electronic health records (EHR). By modeling the cross-modal interactions and the multi-instance structure of TTE studies, our approach significantly improves predictive accuracy and generalization. The model achieves an R² of 0.603 in peak VO₂ prediction and AUROC of 0.849 in high-risk patient identification, surpassing prior work (R² = 0.529, AUROC = 0.836). On the external validation cohort, the model achieves an R² of 0.541 compared to 0.395 and an AUROC of 0.870 compared to 0.797 from previous work. The improved performance more accurately allows for identification of patients who may benefit from advanced heart failure therapies that otherwise may have been missed.