A web-based tool for cancer risk prediction for middle-aged and elderly adults using machine learning algorithms and self-reported questions

IF 3.3 3区 医学 Q1 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH Annals of Epidemiology Pub Date : 2025-01-01 DOI:10.1016/j.annepidem.2024.12.003
Xingjian Xiao , Xiaohan Yi , Nyi Nyi Soe , Phyu Mon Latt , Luotao Lin , Xuefen Chen , Hualing Song , Bo Sun , Hailei Zhao , Xianglong Xu
{"title":"A web-based tool for cancer risk prediction for middle-aged and elderly adults using machine learning algorithms and self-reported questions","authors":"Xingjian Xiao ,&nbsp;Xiaohan Yi ,&nbsp;Nyi Nyi Soe ,&nbsp;Phyu Mon Latt ,&nbsp;Luotao Lin ,&nbsp;Xuefen Chen ,&nbsp;Hualing Song ,&nbsp;Bo Sun ,&nbsp;Hailei Zhao ,&nbsp;Xianglong Xu","doi":"10.1016/j.annepidem.2024.12.003","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>From a global perspective, China is one of the countries with higher incidence and mortality rates for cancer.</div></div><div><h3>Objective</h3><div>Our objective is to create an online cancer risk prediction tool for middle-aged and elderly Chinese adults by leveraging machine learning algorithms and self-reported data.</div></div><div><h3>Method</h3><div>Drawing from a cohort of 19,798 participants aged 45 and above from the China Health and Retirement Longitudinal Study (2011 - 2018), we employed nine machine learning algorithms (LR: Logistic Regression, Adaboost: Adaptive Boosting, SVM: Support Vector Machine, RF: Random Forest, GNB: Gaussian Naive Bayes, GBM: Gradient Boosting Machine, LGBM: Light Gradient Boosting Machine, XGBoost: eXtreme Gradient Boosting, KNN: K - Nearest Neighbors), which are mainly used for classification and regression tasks, to construct predictive models for various cancers. Utilizing non-invasive self-reported predictors encompassing demographic, educational, marital, lifestyle, health history, and other factors, we focused on predicting \"Cancer or Malignant Tumour\" outcomes. The types of cancers that can be predicted mainly include lung cancer, breast cancer, cervical cancer, colorectal cancer, gastric cancer, esophageal cancer, and other rare cancers.</div></div><div><h3>Results</h3><div>The developed tool, MyCancerRisk, demonstrated significant performance, with the Random Forest algorithm achieving an AUC of 0.75 and ACC of 0.99 using self-reported variables. Key predictors identified include age, self-rated health, sleep patterns, household heating sources, childhood health status, living conditions, and smoking habits.</div></div><div><h3>Conclusion</h3><div><em>MyCancerRisk</em> aims to serve as a preventative screening tool, encouraging individuals to undergo testing and adopt healthier behaviours to mitigate the public health impact of cancer. Our study also sheds light on unconventional predictors, such as housing conditions, offering valuable insights for refining cancer prediction models.</div></div>","PeriodicalId":50767,"journal":{"name":"Annals of Epidemiology","volume":"101 ","pages":"Pages 27-35"},"PeriodicalIF":3.3000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S104727972400276X","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0

Abstract

Background

From a global perspective, China is one of the countries with higher incidence and mortality rates for cancer.

Objective

Our objective is to create an online cancer risk prediction tool for middle-aged and elderly Chinese adults by leveraging machine learning algorithms and self-reported data.

Method

Drawing from a cohort of 19,798 participants aged 45 and above from the China Health and Retirement Longitudinal Study (2011 - 2018), we employed nine machine learning algorithms (LR: Logistic Regression, Adaboost: Adaptive Boosting, SVM: Support Vector Machine, RF: Random Forest, GNB: Gaussian Naive Bayes, GBM: Gradient Boosting Machine, LGBM: Light Gradient Boosting Machine, XGBoost: eXtreme Gradient Boosting, KNN: K - Nearest Neighbors), which are mainly used for classification and regression tasks, to construct predictive models for various cancers. Utilizing non-invasive self-reported predictors encompassing demographic, educational, marital, lifestyle, health history, and other factors, we focused on predicting "Cancer or Malignant Tumour" outcomes. The types of cancers that can be predicted mainly include lung cancer, breast cancer, cervical cancer, colorectal cancer, gastric cancer, esophageal cancer, and other rare cancers.

Results

The developed tool, MyCancerRisk, demonstrated significant performance, with the Random Forest algorithm achieving an AUC of 0.75 and ACC of 0.99 using self-reported variables. Key predictors identified include age, self-rated health, sleep patterns, household heating sources, childhood health status, living conditions, and smoking habits.

Conclusion

MyCancerRisk aims to serve as a preventative screening tool, encouraging individuals to undergo testing and adopt healthier behaviours to mitigate the public health impact of cancer. Our study also sheds light on unconventional predictors, such as housing conditions, offering valuable insights for refining cancer prediction models.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于网络的中老年人癌症风险预测工具,使用机器学习算法和自我报告问题。
背景:从全球范围来看,中国是癌症发病率和死亡率较高的国家之一。目的:我们的目标是利用机器学习算法和自我报告数据,为中国中老年成年人创建一个在线癌症风险预测工具。方法:从中国健康与退休纵向研究(2011 - 2018)中抽取了19,798名45岁及以上的参与者,采用了9种机器学习算法(LR: Logistic回归、Adaboost:自适应增强、SVM:支持向量机、RF:随机森林、GNB:高斯朴素贝叶斯、GBM:梯度增强机、LGBM:轻梯度增强机、XGBoost:极端梯度增强、KNN:K - Nearest Neighbors),主要用于分类和回归任务,构建各种癌症的预测模型。利用非侵入性的自我报告预测因子,包括人口统计、教育、婚姻、生活方式、健康史和其他因素,我们专注于预测“癌症或恶性肿瘤”的结果。可预测的癌症类型主要有肺癌、乳腺癌、宫颈癌、结直肠癌、胃癌、食管癌等罕见癌症。结果:开发的工具MyCancerRisk表现出显著的性能,随机森林算法使用自我报告变量实现AUC为0.75,ACC为0.99。确定的关键预测因素包括年龄、自评健康、睡眠模式、家庭供暖来源、儿童健康状况、生活条件和吸烟习惯。结论:MyCancerRisk旨在作为一种预防性筛查工具,鼓励个人接受检测并采取更健康的行为,以减轻癌症对公共卫生的影响。我们的研究还揭示了非常规的预测因素,如住房条件,为完善癌症预测模型提供了有价值的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Annals of Epidemiology
Annals of Epidemiology 医学-公共卫生、环境卫生与职业卫生
CiteScore
7.40
自引率
1.80%
发文量
207
审稿时长
59 days
期刊介绍: The journal emphasizes the application of epidemiologic methods to issues that affect the distribution and determinants of human illness in diverse contexts. Its primary focus is on chronic and acute conditions of diverse etiologies and of major importance to clinical medicine, public health, and health care delivery.
期刊最新文献
Utilization of locally estimated scatterplot smoothing (LOESS) regression to estimate missing weights in a longitudinal cohort of breast cancer patients. Editorial Board The Abraham Lilienfeld Award of the American College of Epidemiology – Not staying in our lane, September 11, 2024 Algorithm development for the automation of death certificate analysis and coding Social vulnerability and the prevalence of autism spectrum disorder among 8-year-old children, Autism and Developmental Disabilities Monitoring Network, 2020
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1