An AI-driven Predictive Model for Pancreatic Cancer Patients Using Extreme Gradient Boosting

Aditya Chakraborty, Chris P. Tsokos
{"title":"An AI-driven Predictive Model for Pancreatic Cancer Patients Using Extreme Gradient Boosting","authors":"Aditya Chakraborty, Chris P. Tsokos","doi":"10.1007/s44199-023-00063-7","DOIUrl":null,"url":null,"abstract":"Abstract Pancreatic cancer is one of the deadliest carcinogenic diseases affecting people all over the world. The majority of patients are usually detected at Stage III or Stage IV, and the chances of survival are very low once detected at the late stages. This study focuses on building an efficient data-driven analytical predictive model based on the associated risk factors and identifying the most contributing factors influencing the survival times of patients diagnosed with pancreatic cancer using the XGBoost (eXtreme Gradient Boosting) algorithm. The grid-search mechanism was implemented to compute the optimum values of the hyper-parameters of the analytical model by minimizing the root mean square error (RMSE). The optimum hyperparameters of the final analytical model were selected by comparing the values with 243 competing models. To check the validity of the model, we compared the model’s performance with ten deep neural network models, grown sequentially with different activation functions and optimizers. We also constructed an ensemble model using Gradient Boosting Machine (GBM). The proposed XGBoost model outperformed all competing models we considered with regard to root mean square error (RMSE). After developing the model, the individual risk factors were ranked according to their individual contribution to the response predictions, which is extremely important for pancreatic research organizations to spend their resources on the risk factors causing/influencing the particular type of cancer. The three most influencing risk factors affecting the survival of pancreatic cancer patients were found to be the age of the patient, current BMI, and cigarette smoking years with contributing percentages of 35.5%, 24.3%, and 14.93%, respectively. The predictive model is approximately 96.42% accurate in predicting the survival times of the patients diagnosed with pancreatic cancer and performs excellently on test data. The analytical methodology of developing the model can be utilized for prediction purposes. It can be utilized to predict the time to death related to a specific type of cancer, given a set of numeric, and non-numeric features.","PeriodicalId":45080,"journal":{"name":"Journal of Statistical Theory and Applications","volume":null,"pages":null},"PeriodicalIF":1.0000,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Statistical Theory and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s44199-023-00063-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract Pancreatic cancer is one of the deadliest carcinogenic diseases affecting people all over the world. The majority of patients are usually detected at Stage III or Stage IV, and the chances of survival are very low once detected at the late stages. This study focuses on building an efficient data-driven analytical predictive model based on the associated risk factors and identifying the most contributing factors influencing the survival times of patients diagnosed with pancreatic cancer using the XGBoost (eXtreme Gradient Boosting) algorithm. The grid-search mechanism was implemented to compute the optimum values of the hyper-parameters of the analytical model by minimizing the root mean square error (RMSE). The optimum hyperparameters of the final analytical model were selected by comparing the values with 243 competing models. To check the validity of the model, we compared the model’s performance with ten deep neural network models, grown sequentially with different activation functions and optimizers. We also constructed an ensemble model using Gradient Boosting Machine (GBM). The proposed XGBoost model outperformed all competing models we considered with regard to root mean square error (RMSE). After developing the model, the individual risk factors were ranked according to their individual contribution to the response predictions, which is extremely important for pancreatic research organizations to spend their resources on the risk factors causing/influencing the particular type of cancer. The three most influencing risk factors affecting the survival of pancreatic cancer patients were found to be the age of the patient, current BMI, and cigarette smoking years with contributing percentages of 35.5%, 24.3%, and 14.93%, respectively. The predictive model is approximately 96.42% accurate in predicting the survival times of the patients diagnosed with pancreatic cancer and performs excellently on test data. The analytical methodology of developing the model can be utilized for prediction purposes. It can be utilized to predict the time to death related to a specific type of cancer, given a set of numeric, and non-numeric features.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于极端梯度增强的胰腺癌患者ai预测模型
胰腺癌是危害人类健康的致癌性疾病之一。大多数患者通常在III期或IV期被发现,一旦在晚期被发现,生存的机会非常低。本研究的重点是基于相关危险因素构建高效的数据驱动分析预测模型,并利用XGBoost (eXtreme Gradient Boosting)算法识别影响胰腺癌患者生存时间的最大因素。采用网格搜索机制,通过最小化均方根误差(RMSE)来计算解析模型超参数的最优值。通过与243个竞争模型的数值比较,选择了最终解析模型的最优超参数。为了验证该模型的有效性,我们将该模型的性能与10个深度神经网络模型进行了比较,这些模型采用不同的激活函数和优化器顺序生长。我们还使用梯度增强机(Gradient Boosting Machine, GBM)构建了一个集成模型。提出的XGBoost模型在均方根误差(RMSE)方面优于我们考虑的所有竞争模型。在建立模型后,根据个体对反应预测的贡献对个体危险因素进行排名,这对于胰腺研究机构将资源用于研究导致/影响特定类型癌症的危险因素至关重要。影响胰腺癌患者生存的三个最主要危险因素是患者年龄、当前BMI和吸烟年限,贡献率分别为35.5%、24.3%和14.93%。该预测模型预测胰腺癌患者生存时间的准确率约为96.42%,在测试数据上表现出色。开发模型的分析方法可用于预测目的。给定一组数字和非数字特征,它可以用来预测与特定类型癌症相关的死亡时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.30
自引率
0.00%
发文量
13
审稿时长
13 weeks
期刊最新文献
The Transformed MG-Extended Exponential Distribution: Properties and Applications Utilizing Repetitive Sampling in the Construction of a Control Chart for Lindley Distribution with Time Truncation Deriving the Distribution and Exploring the Utility of Partial $$R^2$$ in the Era of Big Data Neutrosophic Topp-Leone Distribution for Interval-Valued Data Analysis Topp-Leone Exponentiated Pareto Distribution: Properties and Application to Covid-19 Data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1