Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data. Academic Article uri icon

Overview

abstract

  • OBJECTIVE: To compare anonymized and non-anonymized approaches for imputing race and ethnicity in descriptive studies of chronic disease burden using electronic health record (EHR)-based datasets. STUDY SETTING AND DESIGN: In this New York City-based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self-reported race and ethnicity was measured with Cohen's kappa (κ). We then applied these methods to impute race and ethnicity in two EHR-based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches. DATA SOURCES AND ANALYTIC SAMPLE: Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions. PRINCIPAL FINDINGS: Under simulation analyses, the non-anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (κsingle = 0.25, κMICE = 0.25, κrandomforest = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis. CONCLUSIONS: BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.

publication date

  • May 27, 2025

Identity

Digital Object Identifier (DOI)

  • 10.1111/1475-6773.14649

PubMed ID

  • 40421571