Predicting self-reported injury status among runners training for the New York City Marathon.
Academic Article
Overview
abstract
BACKGROUND: Although numerous studies have examined risk factors and prevention strategies for running-related injuries, few have rigorously tested predictive models. OBJECTIVE: To describe injury patterns throughout marathon training and assess the feasibility of machine learning (ML) models in predicting upcoming weekly running injury status during marathon training with activity logs and runner-reported survey data. DESIGN: In this prospective observational study, participants completed baseline surveys, 16 weekly interval surveys during marathon training, and shared GPS watch and smartphone-based running logs from Strava. Injury status was summarized and used to train two tabular ML models incorporating baseline, prior surveys, and aggregated training logs. SETTING: All data were collected remotely via online platforms. PARTICIPANTS: A total of 643 adult runners (53% female, mean age 43 years) training for the 2022 Tata Consultancy Services New York City Marathon were recruited and had linkable Strava running logs. INTERVENTIONS: N/A. MAIN OUTCOME MEASURE: Self-reported weekly injury status indicating modification of training. RESULTS: Of 643 runners, 307 (48%) experienced at least one injury requiring modification of training during the analysis window. Across 9002 runner-week observations, 75% indicated no injury. Injury status tended to persist, with most runners maintaining the same status week-to-week, and recovery was more common than worsening. For the first model (all runner-week observations), predictive modeling with generalized additive models yielded good performance, specifically an area under the receiver operating characteristic curve (AUROC) = 87% and area under the precision-recall curve (AUPRC) = 52%. For the second model (previously uninjured runner-week observations), performance was poor, yielding AUROC = 67% and AUPRC = 8%. Top features' partial dependency plots often showed nonlinear relationships with injury risk. CONCLUSIONS: ML using survey and running activity data had low discriminatory power in predicting weekly running injury among runners not already modifying their training, though results highlight the importance of the prior week's injury status. Larger samples, more precise injury timing, and additional predictors are likely needed to improve performance and inform future injury prevention strategies.