Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data.

Overview

abstract

OBJECTIVE: To compare anonymized and non-anonymized approaches for imputing race and ethnicity in descriptive studies of chronic disease burden using electronic health record (EHR)-based datasets. STUDY SETTING AND DESIGN: In this New York City-based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self-reported race and ethnicity was measured with Cohen's kappa (κ). We then applied these methods to impute race and ethnicity in two EHR-based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches. DATA SOURCES AND ANALYTIC SAMPLE: Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions. PRINCIPAL FINDINGS: Under simulation analyses, the non-anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (κ _single = 0.25, κ _MICE = 0.25, κ _randomforest = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis. CONCLUSIONS: BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.

authors

Weiner, Mark
Adhikari, Samrachana

publication date

May 27, 2025

published in

Health services research Journal

Research

keywords

Electronic Health Records
Ethnicity
Racial Groups

Identity

PubMed Central ID

PMC12461102

Scopus Document Identifier

105006746317

Digital Object Identifier (DOI)

10.1111/1475-6773.14649

PubMed ID

40421571

Additional Document Info

has global citation frequency

1

volume

60

issue

5

VIVO Weill Cornell Medical College

Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data. Academic Article

Overview

abstract

authors

publication date

published in

Research

keywords

Identity

PubMed Central ID

Scopus Document Identifier

Digital Object Identifier (DOI)

PubMed ID

Additional Document Info

has global citation frequency

volume

issue