Lan Wang, Yonghua Yin, Ben Glampson, Robert Peach, Mauricio Barahona, Brendan C Delaney, Erik K Mayer
{"title":"Transformer-based deep learning model for the diagnosis of suspected lung cancer in primary care based on electronic health record data.","authors":"Lan Wang, Yonghua Yin, Ben Glampson, Robert Peach, Mauricio Barahona, Brendan C Delaney, Erik K Mayer","doi":"10.1016/j.ebiom.2024.105442","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Due to its late stage of diagnosis lung cancer is the commonest cause of death from cancer in the UK. Existing epidemiological risk models in clinical usage, which have Positive Predictive Values (PPV) of less than 10%, do not consider the temporal relations expressed in sequential electronic health record (EHR) data. We aimed to build a model for lung cancer early detection in primary care using machine learning with deep 'transformer' models on EHR data to learn from these complex sequential 'care pathways'.</p><p><strong>Methods: </strong>We split the Whole Systems Integrated Care (WSIC) dataset into 70% training and 30% validation. Within the training set we created a case-control study with lung cancer cases and control cases of 'other' cancers or respiratory conditions or 'other' non cancer conditions. Based on 3,303,992 patients from January 1981 to December 2020 there were 11,847 lung cancer cases. 5789 cases and 7240 controls were used for training and 50,000 randomly selected patients out of the whole validation population of 368,906 for validation. GP EHR data going back three years from the date of diagnosis less the most recent one months were semantically pre-processed by mapping from more than 30,000 terms to 450. Model building was performed using ALBERT with a Logistic Regression Classifier (LRC) head. Clustering was explored using k-means. An additional regression model alone was built on the pre-processed data as a comparator.</p><p><strong>Findings: </strong>Our model achieved an AUROC of 0.924 (95% CI 0.921-0.927) with a PPV of 3.6% (95% CI 3.5-3.7) and Sensitivity of 86.6% (95% CI 85.3-87.8) based on the three year's data prior to diagnosis less the immediate month before index diagnosis. The comparator regression model achieved a PPV of 3.1% (95% CI 3.0-3.1) and AUROC of 0.887 (95% CI 0.884-0.889). We interpreted our model using cluster analysis and have identified six groups of patients exhibiting similar lung cancer progression patterns and clinical investigation patterns.</p><p><strong>Interpretation: </strong>Capturing temporal sequencing between cancer and non-cancer pathways to diagnosis enables much more accurate models. Future work will focus on external dataset validation and integration into GP clinical systems for evaluation.</p><p><strong>Funding: </strong>Cancer Research UK.</p>","PeriodicalId":11494,"journal":{"name":"EBioMedicine","volume":"110 ","pages":"105442"},"PeriodicalIF":9.7000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EBioMedicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.ebiom.2024.105442","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Due to its late stage of diagnosis lung cancer is the commonest cause of death from cancer in the UK. Existing epidemiological risk models in clinical usage, which have Positive Predictive Values (PPV) of less than 10%, do not consider the temporal relations expressed in sequential electronic health record (EHR) data. We aimed to build a model for lung cancer early detection in primary care using machine learning with deep 'transformer' models on EHR data to learn from these complex sequential 'care pathways'.
Methods: We split the Whole Systems Integrated Care (WSIC) dataset into 70% training and 30% validation. Within the training set we created a case-control study with lung cancer cases and control cases of 'other' cancers or respiratory conditions or 'other' non cancer conditions. Based on 3,303,992 patients from January 1981 to December 2020 there were 11,847 lung cancer cases. 5789 cases and 7240 controls were used for training and 50,000 randomly selected patients out of the whole validation population of 368,906 for validation. GP EHR data going back three years from the date of diagnosis less the most recent one months were semantically pre-processed by mapping from more than 30,000 terms to 450. Model building was performed using ALBERT with a Logistic Regression Classifier (LRC) head. Clustering was explored using k-means. An additional regression model alone was built on the pre-processed data as a comparator.
Findings: Our model achieved an AUROC of 0.924 (95% CI 0.921-0.927) with a PPV of 3.6% (95% CI 3.5-3.7) and Sensitivity of 86.6% (95% CI 85.3-87.8) based on the three year's data prior to diagnosis less the immediate month before index diagnosis. The comparator regression model achieved a PPV of 3.1% (95% CI 3.0-3.1) and AUROC of 0.887 (95% CI 0.884-0.889). We interpreted our model using cluster analysis and have identified six groups of patients exhibiting similar lung cancer progression patterns and clinical investigation patterns.
Interpretation: Capturing temporal sequencing between cancer and non-cancer pathways to diagnosis enables much more accurate models. Future work will focus on external dataset validation and integration into GP clinical systems for evaluation.
EBioMedicineBiochemistry, Genetics and Molecular Biology-General Biochemistry,Genetics and Molecular Biology
CiteScore
17.70
自引率
0.90%
发文量
579
审稿时长
5 weeks
期刊介绍:
eBioMedicine is a comprehensive biomedical research journal that covers a wide range of studies that are relevant to human health. Our focus is on original research that explores the fundamental factors influencing human health and disease, including the discovery of new therapeutic targets and treatments, the identification of biomarkers and diagnostic tools, and the investigation and modification of disease pathways and mechanisms. We welcome studies from any biomedical discipline that contribute to our understanding of disease and aim to improve human health.