Yuchen Li, Jennifer Law, Lisa W Le, Janice J N Li, Christopher Pettengell, Patricia Demarco, Michael Duong, David Merritt, Sean Davidson, Mike Sung, Qixuan Li, Sally Cm Lau, Sajda Zahir, Ryan Chu, Malcom Ryan, Khizar Karim, Josh Morganstein, Adrian Sacher, Lawson Eng, Frances A Shepherd, Penelope Bradbury, Geoffrey Liu, Natasha B Leighl
{"title":"Assessing the feasibility and external validity of natural language processing-extracted data for advanced lung cancer patients.","authors":"Yuchen Li, Jennifer Law, Lisa W Le, Janice J N Li, Christopher Pettengell, Patricia Demarco, Michael Duong, David Merritt, Sean Davidson, Mike Sung, Qixuan Li, Sally Cm Lau, Sajda Zahir, Ryan Chu, Malcom Ryan, Khizar Karim, Josh Morganstein, Adrian Sacher, Lawson Eng, Frances A Shepherd, Penelope Bradbury, Geoffrey Liu, Natasha B Leighl","doi":"10.1016/j.lungcan.2025.108080","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Manual extraction of real-world clinical data for research can be time-consuming and prone to error. We assessed the feasibility of using natural language processing (NLP), an AI technique, to automate data extraction for patients with advanced lung cancer (aLC). We assessed the external validity of our NLP-extracted data by comparing our findings to those reported in the literature.</p><p><strong>Methods: </strong>Patients diagnosed with stage IIIB or IV lung cancer between January 2015 to December 2017 at Princess Margaret Cancer Centre who received at least one dose of systemic therapy were included. Their electronic health records were provided to Pentavere's NLP platform, DARWEN<sup>TM</sup>, in March 2019. Descriptive statistics summarized baseline patient and cancer characteristics, molecular biomarkers, and first-line systemic therapies. Cox multivariate models were used to evaluate prognostic factors for advanced non-small cell lung cancer (NSCLC) and small-cell lung cancer (SCLC) cohort.</p><p><strong>Result: </strong>NLP extracted clinical information (n = 333 patients) in a total of 8 hours, with only a few missing data for smoking status (n = 2), and Eastern Cooperative Oncology Group (ECOG) status (n = 5). Baseline patient and cancer characteristics summarized from NLP-extracted data were comparable to those in previous studies and population reports. For NSCLC patients, being male (HR 1.44, 95 % CI [1.04, 2.00]), having worse ECOG (1.48 [1.22, 1.81]), and having liver (2.24 [1.45, 3.46]), bone (2.09 [1.48, 2.96]), or lung metastases (2.54 [1.05, 2.26]) were associated with worse survival outcomes. For SCLC patients, having older age (HR 1.70 per 10 years, 95 % CI [1.10, 2.63]) and liver metastases (3.81 [1.61, 9.01]) were associated with worse survival outcomes.</p><p><strong>Conclusion: </strong>Our study demonstrated that automated data extraction using NLP is feasible and time efficient. Additionally, the NLP-extracted data can be used to identify valid and useful clinical endpoints for research. NLP holds significant potential to accelerate the extraction of real-world data for future observational studies.</p>","PeriodicalId":18129,"journal":{"name":"Lung Cancer","volume":"199 ","pages":"108080"},"PeriodicalIF":4.5000,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lung Cancer","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.lungcan.2025.108080","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Manual extraction of real-world clinical data for research can be time-consuming and prone to error. We assessed the feasibility of using natural language processing (NLP), an AI technique, to automate data extraction for patients with advanced lung cancer (aLC). We assessed the external validity of our NLP-extracted data by comparing our findings to those reported in the literature.
Methods: Patients diagnosed with stage IIIB or IV lung cancer between January 2015 to December 2017 at Princess Margaret Cancer Centre who received at least one dose of systemic therapy were included. Their electronic health records were provided to Pentavere's NLP platform, DARWENTM, in March 2019. Descriptive statistics summarized baseline patient and cancer characteristics, molecular biomarkers, and first-line systemic therapies. Cox multivariate models were used to evaluate prognostic factors for advanced non-small cell lung cancer (NSCLC) and small-cell lung cancer (SCLC) cohort.
Result: NLP extracted clinical information (n = 333 patients) in a total of 8 hours, with only a few missing data for smoking status (n = 2), and Eastern Cooperative Oncology Group (ECOG) status (n = 5). Baseline patient and cancer characteristics summarized from NLP-extracted data were comparable to those in previous studies and population reports. For NSCLC patients, being male (HR 1.44, 95 % CI [1.04, 2.00]), having worse ECOG (1.48 [1.22, 1.81]), and having liver (2.24 [1.45, 3.46]), bone (2.09 [1.48, 2.96]), or lung metastases (2.54 [1.05, 2.26]) were associated with worse survival outcomes. For SCLC patients, having older age (HR 1.70 per 10 years, 95 % CI [1.10, 2.63]) and liver metastases (3.81 [1.61, 9.01]) were associated with worse survival outcomes.
Conclusion: Our study demonstrated that automated data extraction using NLP is feasible and time efficient. Additionally, the NLP-extracted data can be used to identify valid and useful clinical endpoints for research. NLP holds significant potential to accelerate the extraction of real-world data for future observational studies.
期刊介绍:
Lung Cancer is an international publication covering the clinical, translational and basic science of malignancies of the lung and chest region.Original research articles, early reports, review articles, editorials and correspondence covering the prevention, epidemiology and etiology, basic biology, pathology, clinical assessment, surgery, chemotherapy, radiotherapy, combined treatment modalities, other treatment modalities and outcomes of lung cancer are welcome.