Zhiyao Luo , Jiannan Yang , Tingting Zhu , William Chi Wai Wong , Jiandong Zhou
{"title":"基于语言模型的早期发现结直肠癌发生从既往疾病诊断轨迹数据用于初级保健目的","authors":"Zhiyao Luo , Jiannan Yang , Tingting Zhu , William Chi Wai Wong , Jiandong Zhou","doi":"10.1016/j.lanwpc.2024.101381","DOIUrl":null,"url":null,"abstract":"<div><div>The early prediction of colorectal cancer (CRC) in general patients is critical for improving outcomes through timely diagnosis and intervention. However, accurately predicting CRC is challenging due to the substantial variability in disease progression and the heterogeneity in patient presentations. In this study, we propose a non-invasive, accurate approach for predicting early-stage CRC using language models applied to longitudinal electronic health records (EHR) data.</div><div>Our study evaluates the effectiveness of multiple language models, including Transformer, Recurrent Neural Network (RNN), and Multi-Layer Perceptron (MLP), in predicting CRC occurrence. The models were trained using a two-stage process of pretraining and finetuning. For model development, we used a dataset comprising the diagnostic history of 136,801 patients (45.6% male, median age 65.8 years [IQR: 52.5-76.0]) admitted between January 1, 2000, and December 31, 2003, with follow-up data through December 31, 2023. A five-fold cross-validation approach was employed, with 80% of the data used for model training and 20% reserved for evaluation. This training process was repeated five times to mitigate potential overfitting. To gain insights into the model’s interpretability, we applied the GradientSHAP method to identify key disease features along the CRC risk trajectory. We further validated the model on an independent cohort of 1 million family medicine patients (46.9% male, median age at recruitment 63.8 years [IQR: 54.8-77.5]), enrolled between January 1, 2008, and December 31, 2009, with follow-up until May 31, 2024. This cohort included 4, 301 patients diagnosed with CRC (63.0% male, median age at diagnosis 66.6 years [IQR: 56.3-76.3]), alongside matched controls. Notably, none of these patients were included in the model training set. For the sensitivity analysis, we employed a bootstrapping statistical test with 1,000 iterations to assess metrics such as AUROC and AUPRC for significance.</div><div>Among the evaluated models (i.e., Transformer, RNN, and MLP model with variants of using only diagnosis trajectory, both demographics and diagnosis trajectory, all demographics, diagnosis trajectory and medication histories as model input), RNN demonstrated superior performance in predicting CRC when incorporating patient demographics, disease history, and medication data. The RNN achieved an area under the receiver operating characteristic (AUROC) score of 0.887 (95% CI: 0.886-0.889) on the holdout validation cohort and 0.903 (95% CI: 0.882-0.902) on the independent validation cohort. This performance surpassed that of the Transformer and MLP models. The GradientSHAP analysis revealed that the RNN model’s predictions aligned well with clinical observations. Specifically, the model identified histories of “Crohn’s disease” (ICD-9: 555.x), “Ulcerative colitis” (ICD-9: 556.x), “Overweight and obesity” (ICD-9: 278), and “Diabetes” (ICD-9: 250.x) as the top predictors of CRC, which is consistent with established clinical risk factors.</div><div>In summary, this study demonstrates that language model-based analysis of EHR data offers a highly accurate method for the early detection of CRC in general patients. This approach holds great promise for practical application in primary care settings, potentially improving early diagnosis and patient survival rates.</div></div>","PeriodicalId":22792,"journal":{"name":"The Lancet Regional Health: Western Pacific","volume":"55 ","pages":"Article 101381"},"PeriodicalIF":8.1000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Language model-based early detection of colorectal cancer occurrence from previous disease diagnosis trajectory data for primary care purpose\",\"authors\":\"Zhiyao Luo , Jiannan Yang , Tingting Zhu , William Chi Wai Wong , Jiandong Zhou\",\"doi\":\"10.1016/j.lanwpc.2024.101381\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The early prediction of colorectal cancer (CRC) in general patients is critical for improving outcomes through timely diagnosis and intervention. However, accurately predicting CRC is challenging due to the substantial variability in disease progression and the heterogeneity in patient presentations. In this study, we propose a non-invasive, accurate approach for predicting early-stage CRC using language models applied to longitudinal electronic health records (EHR) data.</div><div>Our study evaluates the effectiveness of multiple language models, including Transformer, Recurrent Neural Network (RNN), and Multi-Layer Perceptron (MLP), in predicting CRC occurrence. The models were trained using a two-stage process of pretraining and finetuning. For model development, we used a dataset comprising the diagnostic history of 136,801 patients (45.6% male, median age 65.8 years [IQR: 52.5-76.0]) admitted between January 1, 2000, and December 31, 2003, with follow-up data through December 31, 2023. A five-fold cross-validation approach was employed, with 80% of the data used for model training and 20% reserved for evaluation. This training process was repeated five times to mitigate potential overfitting. To gain insights into the model’s interpretability, we applied the GradientSHAP method to identify key disease features along the CRC risk trajectory. We further validated the model on an independent cohort of 1 million family medicine patients (46.9% male, median age at recruitment 63.8 years [IQR: 54.8-77.5]), enrolled between January 1, 2008, and December 31, 2009, with follow-up until May 31, 2024. This cohort included 4, 301 patients diagnosed with CRC (63.0% male, median age at diagnosis 66.6 years [IQR: 56.3-76.3]), alongside matched controls. Notably, none of these patients were included in the model training set. For the sensitivity analysis, we employed a bootstrapping statistical test with 1,000 iterations to assess metrics such as AUROC and AUPRC for significance.</div><div>Among the evaluated models (i.e., Transformer, RNN, and MLP model with variants of using only diagnosis trajectory, both demographics and diagnosis trajectory, all demographics, diagnosis trajectory and medication histories as model input), RNN demonstrated superior performance in predicting CRC when incorporating patient demographics, disease history, and medication data. The RNN achieved an area under the receiver operating characteristic (AUROC) score of 0.887 (95% CI: 0.886-0.889) on the holdout validation cohort and 0.903 (95% CI: 0.882-0.902) on the independent validation cohort. This performance surpassed that of the Transformer and MLP models. The GradientSHAP analysis revealed that the RNN model’s predictions aligned well with clinical observations. Specifically, the model identified histories of “Crohn’s disease” (ICD-9: 555.x), “Ulcerative colitis” (ICD-9: 556.x), “Overweight and obesity” (ICD-9: 278), and “Diabetes” (ICD-9: 250.x) as the top predictors of CRC, which is consistent with established clinical risk factors.</div><div>In summary, this study demonstrates that language model-based analysis of EHR data offers a highly accurate method for the early detection of CRC in general patients. This approach holds great promise for practical application in primary care settings, potentially improving early diagnosis and patient survival rates.</div></div>\",\"PeriodicalId\":22792,\"journal\":{\"name\":\"The Lancet Regional Health: Western Pacific\",\"volume\":\"55 \",\"pages\":\"Article 101381\"},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2025-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Lancet Regional Health: Western Pacific\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666606524003754\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/17 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Lancet Regional Health: Western Pacific","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666606524003754","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/17 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Language model-based early detection of colorectal cancer occurrence from previous disease diagnosis trajectory data for primary care purpose
The early prediction of colorectal cancer (CRC) in general patients is critical for improving outcomes through timely diagnosis and intervention. However, accurately predicting CRC is challenging due to the substantial variability in disease progression and the heterogeneity in patient presentations. In this study, we propose a non-invasive, accurate approach for predicting early-stage CRC using language models applied to longitudinal electronic health records (EHR) data.
Our study evaluates the effectiveness of multiple language models, including Transformer, Recurrent Neural Network (RNN), and Multi-Layer Perceptron (MLP), in predicting CRC occurrence. The models were trained using a two-stage process of pretraining and finetuning. For model development, we used a dataset comprising the diagnostic history of 136,801 patients (45.6% male, median age 65.8 years [IQR: 52.5-76.0]) admitted between January 1, 2000, and December 31, 2003, with follow-up data through December 31, 2023. A five-fold cross-validation approach was employed, with 80% of the data used for model training and 20% reserved for evaluation. This training process was repeated five times to mitigate potential overfitting. To gain insights into the model’s interpretability, we applied the GradientSHAP method to identify key disease features along the CRC risk trajectory. We further validated the model on an independent cohort of 1 million family medicine patients (46.9% male, median age at recruitment 63.8 years [IQR: 54.8-77.5]), enrolled between January 1, 2008, and December 31, 2009, with follow-up until May 31, 2024. This cohort included 4, 301 patients diagnosed with CRC (63.0% male, median age at diagnosis 66.6 years [IQR: 56.3-76.3]), alongside matched controls. Notably, none of these patients were included in the model training set. For the sensitivity analysis, we employed a bootstrapping statistical test with 1,000 iterations to assess metrics such as AUROC and AUPRC for significance.
Among the evaluated models (i.e., Transformer, RNN, and MLP model with variants of using only diagnosis trajectory, both demographics and diagnosis trajectory, all demographics, diagnosis trajectory and medication histories as model input), RNN demonstrated superior performance in predicting CRC when incorporating patient demographics, disease history, and medication data. The RNN achieved an area under the receiver operating characteristic (AUROC) score of 0.887 (95% CI: 0.886-0.889) on the holdout validation cohort and 0.903 (95% CI: 0.882-0.902) on the independent validation cohort. This performance surpassed that of the Transformer and MLP models. The GradientSHAP analysis revealed that the RNN model’s predictions aligned well with clinical observations. Specifically, the model identified histories of “Crohn’s disease” (ICD-9: 555.x), “Ulcerative colitis” (ICD-9: 556.x), “Overweight and obesity” (ICD-9: 278), and “Diabetes” (ICD-9: 250.x) as the top predictors of CRC, which is consistent with established clinical risk factors.
In summary, this study demonstrates that language model-based analysis of EHR data offers a highly accurate method for the early detection of CRC in general patients. This approach holds great promise for practical application in primary care settings, potentially improving early diagnosis and patient survival rates.
期刊介绍:
The Lancet Regional Health – Western Pacific, a gold open access journal, is an integral part of The Lancet's global initiative advocating for healthcare quality and access worldwide. It aims to advance clinical practice and health policy in the Western Pacific region, contributing to enhanced health outcomes. The journal publishes high-quality original research shedding light on clinical practice and health policy in the region. It also includes reviews, commentaries, and opinion pieces covering diverse regional health topics, such as infectious diseases, non-communicable diseases, child and adolescent health, maternal and reproductive health, aging health, mental health, the health workforce and systems, and health policy.