Automated Information Extraction from Unstructured Hematopathology Reports to Support Response Assessment in Myeloproliferative Neoplasms.
Academic Article
Overview
abstract
BACKGROUND: Assessing treatment response in patients with myeloproliferative neoplasms is difficult because data components exist in unstructured bone marrow pathology (hematopathology) reports, which require specialized, manual annotation and interpretation. Although natural language processing (NLP) has been successfully implemented for the extraction of features from solid tumor reports, little is known about its application to hematopathology. METHODS: An open-source NLP framework called Leo was implemented to parse document segments and extract concept phrases utilized for assessing responses in myeloproliferative neoplasms. A reference standard was generated through the manual review of hematopathology notes. RESULTS: Compared to a reference standard (n=300 reports), our NLP method extracted features such as aspirate myeloblasts (F1:0.98) and biopsy reticulin fibrosis (F1:0.93) with high accuracy. However, other values, such as myeloblasts from the biopsy (F1:0.06) and via flow cytometry (F1:0.08), were affected by sparsity representative of reporting conventions. The four features with the highest clinical importance were extracted with F1 scores exceeding 0.90. Whereas manual annotation of 300 reports required 30 hours of staff effort, automated NLP required 3.5 hours of runtime for 34,301 reports. CONCLUSIONS: To the best of our knowledge, this is among the first studies to demonstrate the application of NLP to hematopathology for the purpose of clinical feature extraction. The approach may inform efforts at other institutions, and the code is available at https://github.com/wcmc-research-informatics/BmrExtractor.