泛癌症风险预测模型中的新型机器学习算法：在大型前瞻性队列中的应用

BMJ Oncology Pub Date : 2024-07-01 DOI:10.1136/bmjonc-2023-000087

Xifeng Wu, Huakang Tu, Qingfeng Hu, Shan-Pou Tsai, David Ta-Wei Chu, C. Wen

{"title":"泛癌症风险预测模型中的新型机器学习算法：在大型前瞻性队列中的应用","authors":"Xifeng Wu, Huakang Tu, Qingfeng Hu, Shan-Pou Tsai, David Ta-Wei Chu, C. Wen","doi":"10.1136/bmjonc-2023-000087","DOIUrl":null,"url":null,"abstract":"\n\nTo develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population.\n\n\n\nThis study is a prospective cohort study including 433 549 participants from the prospective MJ cohort including a male cohort (n=208 599) and a female cohort (n=224 950).\n\n\n\nDuring an 8-year median follow-up, 5143 cancers occurred in males and 4764 in females. Compared with Lasso-Cox and Random Survival Forests, XGBoost showed superior performance for both cohorts. The XGBoost model with all 155 features in males and 160 features in females achieved an area under the curve (AUC) of 0.877 and 0.750, respectively. Light models with 31 variables for males and 11 variables for females showed comparable performance: an AUC of 0.876 (95% CI 0.858 to 0.894) in the overall population and 0.818 (95% CI 0.795 to 0.841) in those aged ≥40 years in the male cohort and an AUC of 0.746 (95% CI 0.721 to 0.771) in the overall population and 0.641 (95% CI 0.605 to 0.677) in those aged ≥40 years in the female cohort. High-risk individuals have at least ninefold higher risk of pan-cancer incidence compared with low-risk groups.\n\n\n\nWe developed and internally validated the first machine-learning models based on routine health check-up data to predict pan-cancer risk in the general population and achieved generally good discriminatory ability with a small set of predictors. External validation is warranted before the implementation of our risk model in clinical practice.\n","PeriodicalId":505335,"journal":{"name":"BMJ Oncology","volume":"70 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort\",\"authors\":\"Xifeng Wu, Huakang Tu, Qingfeng Hu, Shan-Pou Tsai, David Ta-Wei Chu, C. Wen\",\"doi\":\"10.1136/bmjonc-2023-000087\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n\\nTo develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population.\\n\\n\\n\\nThis study is a prospective cohort study including 433 549 participants from the prospective MJ cohort including a male cohort (n=208 599) and a female cohort (n=224 950).\\n\\n\\n\\nDuring an 8-year median follow-up, 5143 cancers occurred in males and 4764 in females. Compared with Lasso-Cox and Random Survival Forests, XGBoost showed superior performance for both cohorts. The XGBoost model with all 155 features in males and 160 features in females achieved an area under the curve (AUC) of 0.877 and 0.750, respectively. Light models with 31 variables for males and 11 variables for females showed comparable performance: an AUC of 0.876 (95% CI 0.858 to 0.894) in the overall population and 0.818 (95% CI 0.795 to 0.841) in those aged ≥40 years in the male cohort and an AUC of 0.746 (95% CI 0.721 to 0.771) in the overall population and 0.641 (95% CI 0.605 to 0.677) in those aged ≥40 years in the female cohort. High-risk individuals have at least ninefold higher risk of pan-cancer incidence compared with low-risk groups.\\n\\n\\n\\nWe developed and internally validated the first machine-learning models based on routine health check-up data to predict pan-cancer risk in the general population and achieved generally good discriminatory ability with a small set of predictors. External validation is warranted before the implementation of our risk model in clinical practice.\\n\",\"PeriodicalId\":505335,\"journal\":{\"name\":\"BMJ Oncology\",\"volume\":\"70 3\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMJ Oncology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1136/bmjonc-2023-000087\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Oncology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjonc-2023-000087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本研究是一项前瞻性队列研究，包括来自前瞻性MJ队列的433 549名参与者，其中包括男性队列（n=208 599）和女性队列（n=224 950）。在8年的中位随访期间，男性和女性分别有5143人和4764人罹患癌症。与 Lasso-Cox 和随机生存森林相比，XGBoost 在两个队列中都表现出更优越的性能。包含所有 155 个特征的 XGBoost 模型（男性）和包含 160 个特征的 XGBoost 模型（女性）的曲线下面积（AUC）分别为 0.877 和 0.750。包含男性 31 个变量和女性 11 个变量的轻模型显示出了相当的性能：在总体人群中，AUC 为 0.876（95% CI 0.858 至 0.894），在年龄≥18 岁的人群中，AUC 为 0.818（95% CI 0.795 至 0.841）。男性队列中年龄≥40 岁者的 AUC 为 0.746（95% CI 0.721 至 0.771），女性队列中年龄≥40 岁者的 AUC 为 0.641（95% CI 0.605 至 0.677）。与低风险人群相比，高风险人群的泛癌症发病风险至少高出九倍。我们开发了首个基于常规健康体检数据的机器学习模型，用于预测普通人群的泛癌症风险，并进行了内部验证，在使用少量预测因子的情况下取得了普遍良好的判别能力。在将我们的风险模型应用于临床实践之前，还需要进行外部验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort

To develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population. This study is a prospective cohort study including 433 549 participants from the prospective MJ cohort including a male cohort (n=208 599) and a female cohort (n=224 950). During an 8-year median follow-up, 5143 cancers occurred in males and 4764 in females. Compared with Lasso-Cox and Random Survival Forests, XGBoost showed superior performance for both cohorts. The XGBoost model with all 155 features in males and 160 features in females achieved an area under the curve (AUC) of 0.877 and 0.750, respectively. Light models with 31 variables for males and 11 variables for females showed comparable performance: an AUC of 0.876 (95% CI 0.858 to 0.894) in the overall population and 0.818 (95% CI 0.795 to 0.841) in those aged ≥40 years in the male cohort and an AUC of 0.746 (95% CI 0.721 to 0.771) in the overall population and 0.641 (95% CI 0.605 to 0.677) in those aged ≥40 years in the female cohort. High-risk individuals have at least ninefold higher risk of pan-cancer incidence compared with low-risk groups. We developed and internally validated the first machine-learning models based on routine health check-up data to predict pan-cancer risk in the general population and achieved generally good discriminatory ability with a small set of predictors. External validation is warranted before the implementation of our risk model in clinical practice.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMJ Oncology

自引率

0.00%

发文量