Design and creation of a racially diverse lung cancer registry with detailed genomic and environmental annotation.
Academic Article
Overview
abstract
BACKGROUND: The proportion of lung cancers affecting individuals who have never smoked is growing, with these cancers being prone to harboring mutations in the epidermal growth factor receptor (EGFR) gene. Little is known about risk factors and prognostic indicators for EGFR-mutant cancers, with current research limited by the scarcity of datasets integrating genomic, clinical, and environmental data. METHODS: We created the Meyer Cancer Center-Molecularly Enhanced Lung cancer Database (MCC-MELD) including lung cancer cases from a large catchment area in New York City. We identified cases through linkage to our institution's cancer registry and a clinician-initiated, manually curated database. We linked all cases to the electronic health record and in-house tumor genomic testing results. We used natural language processing (NLP) to extract unstructured genomic testing results and detailed smoking history. We linked geocoded addresses to detailed area-level measures. RESULTS: MCC-MELD contains 9,573 lung cancer patients diagnosed 1988-2024, of whom 20% were non-Hispanic Asian, 14% were non-Hispanic Black, and 8% were Hispanic. We identified 1,092 (11.4%) EGFR-mutant cancers, with NLP identifying 397 cases not identified by structured data. NLP showed high accuracy in ascertaining EGFR status (97%) and quantitative smoking history variables (90-98%). Never-smokers made up 16% of the cases in MCC-MELD. CONCLUSIONS: MCC-MELD is an NLP-enhanced database containing clinical information, genomic testing results, and linkages to area-level data for lung cancer patients from a diverse urban setting. IMPACT: This resource can facilitate studies on lung cancer risk factors, treatment patterns, and outcomes by EGFR and other driver mutation status.