基于语言模型的早期发现结直肠癌发生从既往疾病诊断轨迹数据用于初级保健目的

IF 8.1 1区医学 Q1 HEALTH CARE SCIENCES & SERVICES The Lancet Regional Health: Western Pacific Pub Date : 2025-02-01 Epub Date: 2025-02-17 DOI:10.1016/j.lanwpc.2024.101381

Zhiyao Luo , Jiannan Yang , Tingting Zhu , William Chi Wai Wong , Jiandong Zhou

{"title":"基于语言模型的早期发现结直肠癌发生从既往疾病诊断轨迹数据用于初级保健目的","authors":"Zhiyao Luo , Jiannan Yang , Tingting Zhu , William Chi Wai Wong , Jiandong Zhou","doi":"10.1016/j.lanwpc.2024.101381","DOIUrl":null,"url":null,"abstract":"<div><div>The early prediction of colorectal cancer (CRC) in general patients is critical for improving outcomes through timely diagnosis and intervention. However, accurately predicting CRC is challenging due to the substantial variability in disease progression and the heterogeneity in patient presentations. In this study, we propose a non-invasive, accurate approach for predicting early-stage CRC using language models applied to longitudinal electronic health records (EHR) data.</div><div>Our study evaluates the effectiveness of multiple language models, including Transformer, Recurrent Neural Network (RNN), and Multi-Layer Perceptron (MLP), in predicting CRC occurrence. The models were trained using a two-stage process of pretraining and finetuning. For model development, we used a dataset comprising the diagnostic history of 136,801 patients (45.6% male, median age 65.8 years [IQR: 52.5-76.0]) admitted between January 1, 2000, and December 31, 2003, with follow-up data through December 31, 2023. A five-fold cross-validation approach was employed, with 80% of the data used for model training and 20% reserved for evaluation. This training process was repeated five times to mitigate potential overfitting. To gain insights into the model’s interpretability, we applied the GradientSHAP method to identify key disease features along the CRC risk trajectory. We further validated the model on an independent cohort of 1 million family medicine patients (46.9% male, median age at recruitment 63.8 years [IQR: 54.8-77.5]), enrolled between January 1, 2008, and December 31, 2009, with follow-up until May 31, 2024. This cohort included 4, 301 patients diagnosed with CRC (63.0% male, median age at diagnosis 66.6 years [IQR: 56.3-76.3]), alongside matched controls. Notably, none of these patients were included in the model training set. For the sensitivity analysis, we employed a bootstrapping statistical test with 1,000 iterations to assess metrics such as AUROC and AUPRC for significance.</div><div>Among the evaluated models (i.e., Transformer, RNN, and MLP model with variants of using only diagnosis trajectory, both demographics and diagnosis trajectory, all demographics, diagnosis trajectory and medication histories as model input), RNN demonstrated superior performance in predicting CRC when incorporating patient demographics, disease history, and medication data. The RNN achieved an area under the receiver operating characteristic (AUROC) score of 0.887 (95% CI: 0.886-0.889) on the holdout validation cohort and 0.903 (95% CI: 0.882-0.902) on the independent validation cohort. This performance surpassed that of the Transformer and MLP models. The GradientSHAP analysis revealed that the RNN model’s predictions aligned well with clinical observations. Specifically, the model identified histories of “Crohn’s disease” (ICD-9: 555.x), “Ulcerative colitis” (ICD-9: 556.x), “Overweight and obesity” (ICD-9: 278), and “Diabetes” (ICD-9: 250.x) as the top predictors of CRC, which is consistent with established clinical risk factors.</div><div>In summary, this study demonstrates that language model-based analysis of EHR data offers a highly accurate method for the early detection of CRC in general patients. This approach holds great promise for practical application in primary care settings, potentially improving early diagnosis and patient survival rates.</div></div>","PeriodicalId":22792,"journal":{"name":"The Lancet Regional Health: Western Pacific","volume":"55 ","pages":"Article 101381"},"PeriodicalIF":8.1000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Language model-based early detection of colorectal cancer occurrence from previous disease diagnosis trajectory data for primary care purpose\",\"authors\":\"Zhiyao Luo , Jiannan Yang , Tingting Zhu , William Chi Wai Wong , Jiandong Zhou\",\"doi\":\"10.1016/j.lanwpc.2024.101381\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The early prediction of colorectal cancer (CRC) in general patients is critical for improving outcomes through timely diagnosis and intervention. However, accurately predicting CRC is challenging due to the substantial variability in disease progression and the heterogeneity in patient presentations. In this study, we propose a non-invasive, accurate approach for predicting early-stage CRC using language models applied to longitudinal electronic health records (EHR) data.</div><div>Our study evaluates the effectiveness of multiple language models, including Transformer, Recurrent Neural Network (RNN), and Multi-Layer Perceptron (MLP), in predicting CRC occurrence. The models were trained using a two-stage process of pretraining and finetuning. For model development, we used a dataset comprising the diagnostic history of 136,801 patients (45.6% male, median age 65.8 years [IQR: 52.5-76.0]) admitted between January 1, 2000, and December 31, 2003, with follow-up data through December 31, 2023. A five-fold cross-validation approach was employed, with 80% of the data used for model training and 20% reserved for evaluation. This training process was repeated five times to mitigate potential overfitting. To gain insights into the model’s interpretability, we applied the GradientSHAP method to identify key disease features along the CRC risk trajectory. We further validated the model on an independent cohort of 1 million family medicine patients (46.9% male, median age at recruitment 63.8 years [IQR: 54.8-77.5]), enrolled between January 1, 2008, and December 31, 2009, with follow-up until May 31, 2024. This cohort included 4, 301 patients diagnosed with CRC (63.0% male, median age at diagnosis 66.6 years [IQR: 56.3-76.3]), alongside matched controls. Notably, none of these patients were included in the model training set. For the sensitivity analysis, we employed a bootstrapping statistical test with 1,000 iterations to assess metrics such as AUROC and AUPRC for significance.</div><div>Among the evaluated models (i.e., Transformer, RNN, and MLP model with variants of using only diagnosis trajectory, both demographics and diagnosis trajectory, all demographics, diagnosis trajectory and medication histories as model input), RNN demonstrated superior performance in predicting CRC when incorporating patient demographics, disease history, and medication data. The RNN achieved an area under the receiver operating characteristic (AUROC) score of 0.887 (95% CI: 0.886-0.889) on the holdout validation cohort and 0.903 (95% CI: 0.882-0.902) on the independent validation cohort. This performance surpassed that of the Transformer and MLP models. The GradientSHAP analysis revealed that the RNN model’s predictions aligned well with clinical observations. Specifically, the model identified histories of “Crohn’s disease” (ICD-9: 555.x), “Ulcerative colitis” (ICD-9: 556.x), “Overweight and obesity” (ICD-9: 278), and “Diabetes” (ICD-9: 250.x) as the top predictors of CRC, which is consistent with established clinical risk factors.</div><div>In summary, this study demonstrates that language model-based analysis of EHR data offers a highly accurate method for the early detection of CRC in general patients. This approach holds great promise for practical application in primary care settings, potentially improving early diagnosis and patient survival rates.</div></div>\",\"PeriodicalId\":22792,\"journal\":{\"name\":\"The Lancet Regional Health: Western Pacific\",\"volume\":\"55 \",\"pages\":\"Article 101381\"},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2025-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Lancet Regional Health: Western Pacific\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666606524003754\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/17 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Lancet Regional Health: Western Pacific","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666606524003754","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/17 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

在一般患者中，早期预测结直肠癌（CRC）对于通过及时诊断和干预改善预后至关重要。然而，由于疾病进展的变异性和患者表现的异质性，准确预测结直肠癌是具有挑战性的。在这项研究中，我们提出了一种非侵入性的、准确的方法，利用纵向电子健康记录（EHR）数据的语言模型来预测早期CRC。我们的研究评估了多种语言模型，包括Transformer、递归神经网络（RNN）和多层感知器（MLP），在预测CRC发生方面的有效性。模型的训练采用预训练和微调两阶段的过程。为了建立模型，我们使用了一个包含2000年1月1日至2003年12月31日期间入院的136,801例患者（45.6%为男性，中位年龄65.8岁[IQR: 52.5-76.0]）诊断史的数据集，随访数据至2023年12月31日。采用五重交叉验证方法，80%的数据用于模型训练，20%用于评估。这个训练过程重复了五次，以减轻潜在的过拟合。为了深入了解该模型的可解释性，我们应用了GradientSHAP方法来确定CRC风险轨迹上的关键疾病特征。我们在2008年1月1日至2009年12月31日期间入组的100万家庭医学患者（46.9%为男性，入组时中位年龄63.8岁[IQR: 54.8-77.5]）中进一步验证了该模型，随访至2024年5月31日。该队列包括4301例诊断为结直肠癌的患者（63.0%为男性，诊断时中位年龄66.6岁[IQR: 56.3-76.3]），以及匹配的对照组。值得注意的是，这些患者都没有被纳入模型训练集。对于敏感性分析，我们采用了1000次迭代的自举统计检验来评估AUROC和AUPRC等指标的显著性。在评估的模型中（即Transformer、RNN和仅使用诊断轨迹、人口统计学和诊断轨迹、所有人口统计学、诊断轨迹和用药史作为模型输入的MLP模型），当纳入患者人口统计学、病史和用药数据时，RNN在预测CRC方面表现出优越的性能。RNN在拒绝验证队列中的接受者操作特征（AUROC）评分下的面积为0.887 (95% CI: 0.886-0.889)，在独立验证队列中的面积为0.903 （95% CI: 0.882-0.902）。这一性能超过了Transformer和MLP型号。GradientSHAP分析显示，RNN模型的预测与临床观察结果非常吻合。具体来说，该模型确定了“克罗恩病”（icd - 9:55 .x）、“溃疡性结肠炎”（icd - 9:556 .x）、“超重和肥胖”（icd - 9:278）和“糖尿病”（icd - 9:250 .x）的病史是结直肠癌的主要预测因素，这与已确定的临床危险因素一致。总之，本研究表明基于语言模型的电子病历数据分析为普通患者CRC的早期发现提供了一种高度准确的方法。这种方法在初级保健环境中具有很大的实际应用前景，有可能提高早期诊断和患者存活率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Language model-based early detection of colorectal cancer occurrence from previous disease diagnosis trajectory data for primary care purpose

The early prediction of colorectal cancer (CRC) in general patients is critical for improving outcomes through timely diagnosis and intervention. However, accurately predicting CRC is challenging due to the substantial variability in disease progression and the heterogeneity in patient presentations. In this study, we propose a non-invasive, accurate approach for predicting early-stage CRC using language models applied to longitudinal electronic health records (EHR) data.

Our study evaluates the effectiveness of multiple language models, including Transformer, Recurrent Neural Network (RNN), and Multi-Layer Perceptron (MLP), in predicting CRC occurrence. The models were trained using a two-stage process of pretraining and finetuning. For model development, we used a dataset comprising the diagnostic history of 136,801 patients (45.6% male, median age 65.8 years [IQR: 52.5-76.0]) admitted between January 1, 2000, and December 31, 2003, with follow-up data through December 31, 2023. A five-fold cross-validation approach was employed, with 80% of the data used for model training and 20% reserved for evaluation. This training process was repeated five times to mitigate potential overfitting. To gain insights into the model’s interpretability, we applied the GradientSHAP method to identify key disease features along the CRC risk trajectory. We further validated the model on an independent cohort of 1 million family medicine patients (46.9% male, median age at recruitment 63.8 years [IQR: 54.8-77.5]), enrolled between January 1, 2008, and December 31, 2009, with follow-up until May 31, 2024. This cohort included 4, 301 patients diagnosed with CRC (63.0% male, median age at diagnosis 66.6 years [IQR: 56.3-76.3]), alongside matched controls. Notably, none of these patients were included in the model training set. For the sensitivity analysis, we employed a bootstrapping statistical test with 1,000 iterations to assess metrics such as AUROC and AUPRC for significance.

Among the evaluated models (i.e., Transformer, RNN, and MLP model with variants of using only diagnosis trajectory, both demographics and diagnosis trajectory, all demographics, diagnosis trajectory and medication histories as model input), RNN demonstrated superior performance in predicting CRC when incorporating patient demographics, disease history, and medication data. The RNN achieved an area under the receiver operating characteristic (AUROC) score of 0.887 (95% CI: 0.886-0.889) on the holdout validation cohort and 0.903 (95% CI: 0.882-0.902) on the independent validation cohort. This performance surpassed that of the Transformer and MLP models. The GradientSHAP analysis revealed that the RNN model’s predictions aligned well with clinical observations. Specifically, the model identified histories of “Crohn’s disease” (ICD-9: 555.x), “Ulcerative colitis” (ICD-9: 556.x), “Overweight and obesity” (ICD-9: 278), and “Diabetes” (ICD-9: 250.x) as the top predictors of CRC, which is consistent with established clinical risk factors.

In summary, this study demonstrates that language model-based analysis of EHR data offers a highly accurate method for the early detection of CRC in general patients. This approach holds great promise for practical application in primary care settings, potentially improving early diagnosis and patient survival rates.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The Lancet Regional Health: Western Pacific Medicine-Pediatrics, Perinatology and Child Health

CiteScore

8.80

自引率

2.80%

发文量

305

审稿时长

11 weeks

期刊介绍： The Lancet Regional Health – Western Pacific, a gold open access journal, is an integral part of The Lancet's global initiative advocating for healthcare quality and access worldwide. It aims to advance clinical practice and health policy in the Western Pacific region, contributing to enhanced health outcomes. The journal publishes high-quality original research shedding light on clinical practice and health policy in the region. It also includes reviews, commentaries, and opinion pieces covering diverse regional health topics, such as infectious diseases, non-communicable diseases, child and adolescent health, maternal and reproductive health, aging health, mental health, the health workforce and systems, and health policy.