基于语言模型的早期发现结直肠癌发生从既往疾病诊断轨迹数据用于初级保健目的

IF 8.1 1区 医学 Q1 HEALTH CARE SCIENCES & SERVICES The Lancet Regional Health: Western Pacific Pub Date : 2025-02-01 Epub Date: 2025-02-17 DOI:10.1016/j.lanwpc.2024.101381
Zhiyao Luo , Jiannan Yang , Tingting Zhu , William Chi Wai Wong , Jiandong Zhou
{"title":"基于语言模型的早期发现结直肠癌发生从既往疾病诊断轨迹数据用于初级保健目的","authors":"Zhiyao Luo ,&nbsp;Jiannan Yang ,&nbsp;Tingting Zhu ,&nbsp;William Chi Wai Wong ,&nbsp;Jiandong Zhou","doi":"10.1016/j.lanwpc.2024.101381","DOIUrl":null,"url":null,"abstract":"<div><div>The early prediction of colorectal cancer (CRC) in general patients is critical for improving outcomes through timely diagnosis and intervention. However, accurately predicting CRC is challenging due to the substantial variability in disease progression and the heterogeneity in patient presentations. In this study, we propose a non-invasive, accurate approach for predicting early-stage CRC using language models applied to longitudinal electronic health records (EHR) data.</div><div>Our study evaluates the effectiveness of multiple language models, including Transformer, Recurrent Neural Network (RNN), and Multi-Layer Perceptron (MLP), in predicting CRC occurrence. The models were trained using a two-stage process of pretraining and finetuning. For model development, we used a dataset comprising the diagnostic history of 136,801 patients (45.6% male, median age 65.8 years [IQR: 52.5-76.0]) admitted between January 1, 2000, and December 31, 2003, with follow-up data through December 31, 2023. A five-fold cross-validation approach was employed, with 80% of the data used for model training and 20% reserved for evaluation. This training process was repeated five times to mitigate potential overfitting. To gain insights into the model’s interpretability, we applied the GradientSHAP method to identify key disease features along the CRC risk trajectory. We further validated the model on an independent cohort of 1 million family medicine patients (46.9% male, median age at recruitment 63.8 years [IQR: 54.8-77.5]), enrolled between January 1, 2008, and December 31, 2009, with follow-up until May 31, 2024. This cohort included 4, 301 patients diagnosed with CRC (63.0% male, median age at diagnosis 66.6 years [IQR: 56.3-76.3]), alongside matched controls. Notably, none of these patients were included in the model training set. For the sensitivity analysis, we employed a bootstrapping statistical test with 1,000 iterations to assess metrics such as AUROC and AUPRC for significance.</div><div>Among the evaluated models (i.e., Transformer, RNN, and MLP model with variants of using only diagnosis trajectory, both demographics and diagnosis trajectory, all demographics, diagnosis trajectory and medication histories as model input), RNN demonstrated superior performance in predicting CRC when incorporating patient demographics, disease history, and medication data. The RNN achieved an area under the receiver operating characteristic (AUROC) score of 0.887 (95% CI: 0.886-0.889) on the holdout validation cohort and 0.903 (95% CI: 0.882-0.902) on the independent validation cohort. This performance surpassed that of the Transformer and MLP models. The GradientSHAP analysis revealed that the RNN model’s predictions aligned well with clinical observations. Specifically, the model identified histories of “Crohn’s disease” (ICD-9: 555.x), “Ulcerative colitis” (ICD-9: 556.x), “Overweight and obesity” (ICD-9: 278), and “Diabetes” (ICD-9: 250.x) as the top predictors of CRC, which is consistent with established clinical risk factors.</div><div>In summary, this study demonstrates that language model-based analysis of EHR data offers a highly accurate method for the early detection of CRC in general patients. This approach holds great promise for practical application in primary care settings, potentially improving early diagnosis and patient survival rates.</div></div>","PeriodicalId":22792,"journal":{"name":"The Lancet Regional Health: Western Pacific","volume":"55 ","pages":"Article 101381"},"PeriodicalIF":8.1000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Language model-based early detection of colorectal cancer occurrence from previous disease diagnosis trajectory data for primary care purpose\",\"authors\":\"Zhiyao Luo ,&nbsp;Jiannan Yang ,&nbsp;Tingting Zhu ,&nbsp;William Chi Wai Wong ,&nbsp;Jiandong Zhou\",\"doi\":\"10.1016/j.lanwpc.2024.101381\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The early prediction of colorectal cancer (CRC) in general patients is critical for improving outcomes through timely diagnosis and intervention. However, accurately predicting CRC is challenging due to the substantial variability in disease progression and the heterogeneity in patient presentations. In this study, we propose a non-invasive, accurate approach for predicting early-stage CRC using language models applied to longitudinal electronic health records (EHR) data.</div><div>Our study evaluates the effectiveness of multiple language models, including Transformer, Recurrent Neural Network (RNN), and Multi-Layer Perceptron (MLP), in predicting CRC occurrence. The models were trained using a two-stage process of pretraining and finetuning. For model development, we used a dataset comprising the diagnostic history of 136,801 patients (45.6% male, median age 65.8 years [IQR: 52.5-76.0]) admitted between January 1, 2000, and December 31, 2003, with follow-up data through December 31, 2023. A five-fold cross-validation approach was employed, with 80% of the data used for model training and 20% reserved for evaluation. This training process was repeated five times to mitigate potential overfitting. To gain insights into the model’s interpretability, we applied the GradientSHAP method to identify key disease features along the CRC risk trajectory. We further validated the model on an independent cohort of 1 million family medicine patients (46.9% male, median age at recruitment 63.8 years [IQR: 54.8-77.5]), enrolled between January 1, 2008, and December 31, 2009, with follow-up until May 31, 2024. This cohort included 4, 301 patients diagnosed with CRC (63.0% male, median age at diagnosis 66.6 years [IQR: 56.3-76.3]), alongside matched controls. Notably, none of these patients were included in the model training set. For the sensitivity analysis, we employed a bootstrapping statistical test with 1,000 iterations to assess metrics such as AUROC and AUPRC for significance.</div><div>Among the evaluated models (i.e., Transformer, RNN, and MLP model with variants of using only diagnosis trajectory, both demographics and diagnosis trajectory, all demographics, diagnosis trajectory and medication histories as model input), RNN demonstrated superior performance in predicting CRC when incorporating patient demographics, disease history, and medication data. The RNN achieved an area under the receiver operating characteristic (AUROC) score of 0.887 (95% CI: 0.886-0.889) on the holdout validation cohort and 0.903 (95% CI: 0.882-0.902) on the independent validation cohort. This performance surpassed that of the Transformer and MLP models. The GradientSHAP analysis revealed that the RNN model’s predictions aligned well with clinical observations. Specifically, the model identified histories of “Crohn’s disease” (ICD-9: 555.x), “Ulcerative colitis” (ICD-9: 556.x), “Overweight and obesity” (ICD-9: 278), and “Diabetes” (ICD-9: 250.x) as the top predictors of CRC, which is consistent with established clinical risk factors.</div><div>In summary, this study demonstrates that language model-based analysis of EHR data offers a highly accurate method for the early detection of CRC in general patients. This approach holds great promise for practical application in primary care settings, potentially improving early diagnosis and patient survival rates.</div></div>\",\"PeriodicalId\":22792,\"journal\":{\"name\":\"The Lancet Regional Health: Western Pacific\",\"volume\":\"55 \",\"pages\":\"Article 101381\"},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2025-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Lancet Regional Health: Western Pacific\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666606524003754\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/17 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Lancet Regional Health: Western Pacific","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666606524003754","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/17 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

在一般患者中,早期预测结直肠癌(CRC)对于通过及时诊断和干预改善预后至关重要。然而,由于疾病进展的变异性和患者表现的异质性,准确预测结直肠癌是具有挑战性的。在这项研究中,我们提出了一种非侵入性的、准确的方法,利用纵向电子健康记录(EHR)数据的语言模型来预测早期CRC。我们的研究评估了多种语言模型,包括Transformer、递归神经网络(RNN)和多层感知器(MLP),在预测CRC发生方面的有效性。模型的训练采用预训练和微调两阶段的过程。为了建立模型,我们使用了一个包含2000年1月1日至2003年12月31日期间入院的136,801例患者(45.6%为男性,中位年龄65.8岁[IQR: 52.5-76.0])诊断史的数据集,随访数据至2023年12月31日。采用五重交叉验证方法,80%的数据用于模型训练,20%用于评估。这个训练过程重复了五次,以减轻潜在的过拟合。为了深入了解该模型的可解释性,我们应用了GradientSHAP方法来确定CRC风险轨迹上的关键疾病特征。我们在2008年1月1日至2009年12月31日期间入组的100万家庭医学患者(46.9%为男性,入组时中位年龄63.8岁[IQR: 54.8-77.5])中进一步验证了该模型,随访至2024年5月31日。该队列包括4301例诊断为结直肠癌的患者(63.0%为男性,诊断时中位年龄66.6岁[IQR: 56.3-76.3]),以及匹配的对照组。值得注意的是,这些患者都没有被纳入模型训练集。对于敏感性分析,我们采用了1000次迭代的自举统计检验来评估AUROC和AUPRC等指标的显著性。在评估的模型中(即Transformer、RNN和仅使用诊断轨迹、人口统计学和诊断轨迹、所有人口统计学、诊断轨迹和用药史作为模型输入的MLP模型),当纳入患者人口统计学、病史和用药数据时,RNN在预测CRC方面表现出优越的性能。RNN在拒绝验证队列中的接受者操作特征(AUROC)评分下的面积为0.887 (95% CI: 0.886-0.889),在独立验证队列中的面积为0.903 (95% CI: 0.882-0.902)。这一性能超过了Transformer和MLP型号。GradientSHAP分析显示,RNN模型的预测与临床观察结果非常吻合。具体来说,该模型确定了“克罗恩病”(icd - 9:55 .x)、“溃疡性结肠炎”(icd - 9:556 .x)、“超重和肥胖”(icd - 9:278)和“糖尿病”(icd - 9:250 .x)的病史是结直肠癌的主要预测因素,这与已确定的临床危险因素一致。总之,本研究表明基于语言模型的电子病历数据分析为普通患者CRC的早期发现提供了一种高度准确的方法。这种方法在初级保健环境中具有很大的实际应用前景,有可能提高早期诊断和患者存活率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Language model-based early detection of colorectal cancer occurrence from previous disease diagnosis trajectory data for primary care purpose
The early prediction of colorectal cancer (CRC) in general patients is critical for improving outcomes through timely diagnosis and intervention. However, accurately predicting CRC is challenging due to the substantial variability in disease progression and the heterogeneity in patient presentations. In this study, we propose a non-invasive, accurate approach for predicting early-stage CRC using language models applied to longitudinal electronic health records (EHR) data.
Our study evaluates the effectiveness of multiple language models, including Transformer, Recurrent Neural Network (RNN), and Multi-Layer Perceptron (MLP), in predicting CRC occurrence. The models were trained using a two-stage process of pretraining and finetuning. For model development, we used a dataset comprising the diagnostic history of 136,801 patients (45.6% male, median age 65.8 years [IQR: 52.5-76.0]) admitted between January 1, 2000, and December 31, 2003, with follow-up data through December 31, 2023. A five-fold cross-validation approach was employed, with 80% of the data used for model training and 20% reserved for evaluation. This training process was repeated five times to mitigate potential overfitting. To gain insights into the model’s interpretability, we applied the GradientSHAP method to identify key disease features along the CRC risk trajectory. We further validated the model on an independent cohort of 1 million family medicine patients (46.9% male, median age at recruitment 63.8 years [IQR: 54.8-77.5]), enrolled between January 1, 2008, and December 31, 2009, with follow-up until May 31, 2024. This cohort included 4, 301 patients diagnosed with CRC (63.0% male, median age at diagnosis 66.6 years [IQR: 56.3-76.3]), alongside matched controls. Notably, none of these patients were included in the model training set. For the sensitivity analysis, we employed a bootstrapping statistical test with 1,000 iterations to assess metrics such as AUROC and AUPRC for significance.
Among the evaluated models (i.e., Transformer, RNN, and MLP model with variants of using only diagnosis trajectory, both demographics and diagnosis trajectory, all demographics, diagnosis trajectory and medication histories as model input), RNN demonstrated superior performance in predicting CRC when incorporating patient demographics, disease history, and medication data. The RNN achieved an area under the receiver operating characteristic (AUROC) score of 0.887 (95% CI: 0.886-0.889) on the holdout validation cohort and 0.903 (95% CI: 0.882-0.902) on the independent validation cohort. This performance surpassed that of the Transformer and MLP models. The GradientSHAP analysis revealed that the RNN model’s predictions aligned well with clinical observations. Specifically, the model identified histories of “Crohn’s disease” (ICD-9: 555.x), “Ulcerative colitis” (ICD-9: 556.x), “Overweight and obesity” (ICD-9: 278), and “Diabetes” (ICD-9: 250.x) as the top predictors of CRC, which is consistent with established clinical risk factors.
In summary, this study demonstrates that language model-based analysis of EHR data offers a highly accurate method for the early detection of CRC in general patients. This approach holds great promise for practical application in primary care settings, potentially improving early diagnosis and patient survival rates.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
The Lancet Regional Health: Western Pacific
The Lancet Regional Health: Western Pacific Medicine-Pediatrics, Perinatology and Child Health
CiteScore
8.80
自引率
2.80%
发文量
305
审稿时长
11 weeks
期刊介绍: The Lancet Regional Health – Western Pacific, a gold open access journal, is an integral part of The Lancet's global initiative advocating for healthcare quality and access worldwide. It aims to advance clinical practice and health policy in the Western Pacific region, contributing to enhanced health outcomes. The journal publishes high-quality original research shedding light on clinical practice and health policy in the region. It also includes reviews, commentaries, and opinion pieces covering diverse regional health topics, such as infectious diseases, non-communicable diseases, child and adolescent health, maternal and reproductive health, aging health, mental health, the health workforce and systems, and health policy.
期刊最新文献
The evolving landscape of sleep inequality in China: intersections of sex, age, and socioeconomic stress The landscape of lung cancer screening globally: a comparison of established programs and estimates of eligibility to participate Corrigendum to “Temporal trends in sepsis hospitalisations and mortality in Aotearoa New Zealand, 2000-2019: a population-based study” [The Lancet Regional Health – Western Pacific, Volume 65, December 2025, 101767] The toll of tobacco: smoking-attributable health spending in South Korea, 2014–2024. National claims evidence for cost recovery Long-term outcomes in the pre-moyamoya stage of RNF213-related vasculopathy presenting with ischaemic stroke or transient ischaemic attack: insights from the NCVC Genome Registry
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1