Transformer-based deep learning model for the diagnosis of suspected lung cancer in primary care based on electronic health record data.

IF 9.7 1区医学 Q1 MEDICINE, RESEARCH & EXPERIMENTAL EBioMedicine Pub Date : 2024-11-12 DOI:10.1016/j.ebiom.2024.105442

Lan Wang, Yonghua Yin, Ben Glampson, Robert Peach, Mauricio Barahona, Brendan C Delaney, Erik K Mayer

{"title":"Transformer-based deep learning model for the diagnosis of suspected lung cancer in primary care based on electronic health record data.","authors":"Lan Wang, Yonghua Yin, Ben Glampson, Robert Peach, Mauricio Barahona, Brendan C Delaney, Erik K Mayer","doi":"10.1016/j.ebiom.2024.105442","DOIUrl":null,"url":null,"abstract":"Background: Due to its late stage of diagnosis lung cancer is the commonest cause of death from cancer in the UK. Existing epidemiological risk models in clinical usage, which have Positive Predictive Values (PPV) of less than 10%, do not consider the temporal relations expressed in sequential electronic health record (EHR) data. We aimed to build a model for lung cancer early detection in primary care using machine learning with deep 'transformer' models on EHR data to learn from these complex sequential 'care pathways'.Methods: We split the Whole Systems Integrated Care (WSIC) dataset into 70% training and 30% validation. Within the training set we created a case-control study with lung cancer cases and control cases of 'other' cancers or respiratory conditions or 'other' non cancer conditions. Based on 3,303,992 patients from January 1981 to December 2020 there were 11,847 lung cancer cases. 5789 cases and 7240 controls were used for training and 50,000 randomly selected patients out of the whole validation population of 368,906 for validation. GP EHR data going back three years from the date of diagnosis less the most recent one months were semantically pre-processed by mapping from more than 30,000 terms to 450. Model building was performed using ALBERT with a Logistic Regression Classifier (LRC) head. Clustering was explored using k-means. An additional regression model alone was built on the pre-processed data as a comparator.Findings: Our model achieved an AUROC of 0.924 (95% CI 0.921-0.927) with a PPV of 3.6% (95% CI 3.5-3.7) and Sensitivity of 86.6% (95% CI 85.3-87.8) based on the three year's data prior to diagnosis less the immediate month before index diagnosis. The comparator regression model achieved a PPV of 3.1% (95% CI 3.0-3.1) and AUROC of 0.887 (95% CI 0.884-0.889). We interpreted our model using cluster analysis and have identified six groups of patients exhibiting similar lung cancer progression patterns and clinical investigation patterns.Interpretation: Capturing temporal sequencing between cancer and non-cancer pathways to diagnosis enables much more accurate models. Future work will focus on external dataset validation and integration into GP clinical systems for evaluation.Funding: Cancer Research UK.","PeriodicalId":11494,"journal":{"name":"EBioMedicine","volume":"110 ","pages":"105442"},"PeriodicalIF":9.7000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EBioMedicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.ebiom.2024.105442","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Due to its late stage of diagnosis lung cancer is the commonest cause of death from cancer in the UK. Existing epidemiological risk models in clinical usage, which have Positive Predictive Values (PPV) of less than 10%, do not consider the temporal relations expressed in sequential electronic health record (EHR) data. We aimed to build a model for lung cancer early detection in primary care using machine learning with deep 'transformer' models on EHR data to learn from these complex sequential 'care pathways'.

Methods: We split the Whole Systems Integrated Care (WSIC) dataset into 70% training and 30% validation. Within the training set we created a case-control study with lung cancer cases and control cases of 'other' cancers or respiratory conditions or 'other' non cancer conditions. Based on 3,303,992 patients from January 1981 to December 2020 there were 11,847 lung cancer cases. 5789 cases and 7240 controls were used for training and 50,000 randomly selected patients out of the whole validation population of 368,906 for validation. GP EHR data going back three years from the date of diagnosis less the most recent one months were semantically pre-processed by mapping from more than 30,000 terms to 450. Model building was performed using ALBERT with a Logistic Regression Classifier (LRC) head. Clustering was explored using k-means. An additional regression model alone was built on the pre-processed data as a comparator.

Findings: Our model achieved an AUROC of 0.924 (95% CI 0.921-0.927) with a PPV of 3.6% (95% CI 3.5-3.7) and Sensitivity of 86.6% (95% CI 85.3-87.8) based on the three year's data prior to diagnosis less the immediate month before index diagnosis. The comparator regression model achieved a PPV of 3.1% (95% CI 3.0-3.1) and AUROC of 0.887 (95% CI 0.884-0.889). We interpreted our model using cluster analysis and have identified six groups of patients exhibiting similar lung cancer progression patterns and clinical investigation patterns.

Interpretation: Capturing temporal sequencing between cancer and non-cancer pathways to diagnosis enables much more accurate models. Future work will focus on external dataset validation and integration into GP clinical systems for evaluation.

Funding: Cancer Research UK.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于变压器的深度学习模型，以电子健康记录数据为基础诊断初级保健中的疑似肺癌。

背景：在英国，肺癌是最常见的癌症死因，因为肺癌诊断较晚。临床使用的现有流行病学风险模型的阳性预测值（PPV）低于 10%，而且没有考虑连续电子健康记录（EHR）数据所表达的时间关系。我们的目标是利用机器学习，在电子健康记录数据上建立深度 "转换器 "模型，从这些复杂的连续 "护理路径 "中学习，从而建立初级医疗中的肺癌早期检测模型：我们将全系统综合护理（WSIC）数据集分为 70% 的训练集和 30% 的验证集。在训练集中，我们创建了一个病例对照研究，其中包括肺癌病例和 "其他 "癌症或呼吸系统疾病或 "其他 "非癌症疾病的对照病例。根据 1981 年 1 月至 2020 年 12 月的 3,303,992 名患者，共有 11,847 例肺癌病例。5789 例病例和 7240 例对照用于训练，从 368906 名验证人群中随机抽取的 50000 名患者用于验证。通过将 30,000 多个术语映射到 450 个术语，对全科医生电子病历（GP EHR）数据进行了语义预处理，这些数据可追溯到自诊断之日起的三年前，减去最近的一个月。使用 ALBERT 和逻辑回归分类器 (LRC) 头建立模型。使用 k-means 进行了聚类。此外，还在预处理数据上建立了一个单独的回归模型作为比较：根据诊断前三年的数据，减去指数诊断前一个月的数据，我们的模型的AUROC为0.924（95% CI 0.921-0.927），PPV为3.6%（95% CI 3.5-3.7），灵敏度为86.6%（95% CI 85.3-87.8）。对比回归模型的 PPV 为 3.1%（95% CI 3.0-3.1），AUROC 为 0.887（95% CI 0.884-0.889）。我们利用聚类分析对模型进行了解释，并确定了六组表现出相似肺癌进展模式和临床调查模式的患者：解读：捕捉癌症与非癌症诊断路径之间的时间排序可以建立更准确的模型。未来的工作将侧重于外部数据集验证和整合到全科医生临床系统中进行评估：英国癌症研究中心

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

EBioMedicine Biochemistry, Genetics and Molecular Biology-General Biochemistry,Genetics and Molecular Biology

CiteScore

17.70

自引率

0.90%

发文量

579

审稿时长

5 weeks

期刊介绍： eBioMedicine is a comprehensive biomedical research journal that covers a wide range of studies that are relevant to human health. Our focus is on original research that explores the fundamental factors influencing human health and disease, including the discovery of new therapeutic targets and treatments, the identification of biomarkers and diagnostic tools, and the investigation and modification of disease pathways and mechanisms. We welcome studies from any biomedical discipline that contribute to our understanding of disease and aim to improve human health.