Optimization of multidimensional feature engineering and data partitioning strategies in heart disease prediction models

IF 6.2 2区 工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY alexandria engineering journal Pub Date : 2024-09-18 DOI:10.1016/j.aej.2024.09.037
{"title":"Optimization of multidimensional feature engineering and data partitioning strategies in heart disease prediction models","authors":"","doi":"10.1016/j.aej.2024.09.037","DOIUrl":null,"url":null,"abstract":"<div><p>The relentless rise in heart disease incidence, a leading global cause of death, presents a significant public health challenge. Precise prediction of heart disease risk and early interventions are crucial. This study investigates the performance improvement of heart disease prediction models using machine learning and deep learning algorithms. Initially, we utilized the Heart Failure Prediction Dataset from Kaggle. After preprocessing to ensure data quality, three distinct feature engineering techniques were applied: PCA for dimensionality reduction, ET for feature selection, and Pearson's correlation coefficient for feature selection. We assessed their impact on model performance. The dataset was then partitioned into three different data split ratios—1:9, 2:8, and 3:7—to determine their specific effects on model performance. Twelve machine learning classifiers—LGBM, Adaboost, XGB, RF, DT, KNN, LR, GNB, ET, SVC, GB, and Bagging—were trained and evaluated based on five key metrics: accuracy, recall, precision, F1 score, and training time. The influence of different feature engineering methods and data partitioning ratios on model performance were systematically analyzed using paired-sample t-tests. Among the feature engineering methods compared, the Bagging classifier, when combined with feature selection via ET, exhibited superior performance. It achieved an accuracy of 97.48 % and an F1-Score of 97.48 % with a data split ratio of 1:9 between the test and training sets. With a 2:8 split, the accuracy was 94.96 % and the F1-Score was 94.95 %. For a 3:7 split, the accuracy was 94.12 % and the F1-Score was 94.11 %. Paired sample T-test results indicate that feature selection using Pearson correlation coefficient can shorten training duration, but this also leads to a decline in classifier performance. After applying PCA dimensionality reduction, compared with the control group, there was no significant difference in the training efficiency and efficacy of the classifier. However, feature selection through ET significantly reduced the training time for various classifiers while ensuring their performance.</p></div>","PeriodicalId":7484,"journal":{"name":"alexandria engineering journal","volume":null,"pages":null},"PeriodicalIF":6.2000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1110016824010603/pdfft?md5=8b290daf5a11a49e747ad3d4eb26b34e&pid=1-s2.0-S1110016824010603-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"alexandria engineering journal","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110016824010603","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

The relentless rise in heart disease incidence, a leading global cause of death, presents a significant public health challenge. Precise prediction of heart disease risk and early interventions are crucial. This study investigates the performance improvement of heart disease prediction models using machine learning and deep learning algorithms. Initially, we utilized the Heart Failure Prediction Dataset from Kaggle. After preprocessing to ensure data quality, three distinct feature engineering techniques were applied: PCA for dimensionality reduction, ET for feature selection, and Pearson's correlation coefficient for feature selection. We assessed their impact on model performance. The dataset was then partitioned into three different data split ratios—1:9, 2:8, and 3:7—to determine their specific effects on model performance. Twelve machine learning classifiers—LGBM, Adaboost, XGB, RF, DT, KNN, LR, GNB, ET, SVC, GB, and Bagging—were trained and evaluated based on five key metrics: accuracy, recall, precision, F1 score, and training time. The influence of different feature engineering methods and data partitioning ratios on model performance were systematically analyzed using paired-sample t-tests. Among the feature engineering methods compared, the Bagging classifier, when combined with feature selection via ET, exhibited superior performance. It achieved an accuracy of 97.48 % and an F1-Score of 97.48 % with a data split ratio of 1:9 between the test and training sets. With a 2:8 split, the accuracy was 94.96 % and the F1-Score was 94.95 %. For a 3:7 split, the accuracy was 94.12 % and the F1-Score was 94.11 %. Paired sample T-test results indicate that feature selection using Pearson correlation coefficient can shorten training duration, but this also leads to a decline in classifier performance. After applying PCA dimensionality reduction, compared with the control group, there was no significant difference in the training efficiency and efficacy of the classifier. However, feature selection through ET significantly reduced the training time for various classifiers while ensuring their performance.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
优化心脏病预测模型中的多维特征工程和数据分区策略
心脏病是导致全球死亡的主要原因之一,其发病率的持续上升给公共卫生带来了巨大挑战。精确预测心脏病风险和早期干预至关重要。本研究探讨了利用机器学习和深度学习算法提高心脏病预测模型的性能。最初,我们使用了 Kaggle 的心衰预测数据集。在进行预处理以确保数据质量后,我们采用了三种不同的特征工程技术:用于降维的 PCA、用于特征选择的 ET 和用于特征选择的皮尔逊相关系数。我们评估了这些技术对模型性能的影响。然后,我们将数据集分成三种不同的数据分割比例--1:9、2:8 和 3:7,以确定它们对模型性能的具体影响。对 12 种机器学习分类器--LGBM、Adaboost、XGB、RF、DT、KNN、LR、GNB、ET、SVC、GB 和 Bagging 进行了训练,并根据准确率、召回率、精确度、F1 分数和训练时间这五个关键指标进行了评估。使用配对样本 t 检验系统分析了不同特征工程方法和数据划分比例对模型性能的影响。在所比较的特征工程方法中,通过 ET 结合特征选择的 Bagging 分类器表现出更优越的性能。在测试集和训练集的数据分割比例为 1:9 时,分类器的准确率达到 97.48%,F1 分数达到 97.48%。在数据分割比例为 2:8 时,准确率为 94.96 %,F1 分数为 94.95 %。数据分割比例为 3:7 时,准确率为 94.12 %,F1 分数为 94.11 %。配对样本 T 检验结果表明,使用皮尔逊相关系数进行特征选择可以缩短训练时间,但也会导致分类器性能下降。应用 PCA 降维后,与对照组相比,分类器的训练效率和效果没有显著差异。然而,通过 ET 进行特征选择,在确保分类器性能的同时,大大缩短了各种分类器的训练时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
alexandria engineering journal
alexandria engineering journal Engineering-General Engineering
CiteScore
11.20
自引率
4.40%
发文量
1015
审稿时长
43 days
期刊介绍: Alexandria Engineering Journal is an international journal devoted to publishing high quality papers in the field of engineering and applied science. Alexandria Engineering Journal is cited in the Engineering Information Services (EIS) and the Chemical Abstracts (CA). The papers published in Alexandria Engineering Journal are grouped into five sections, according to the following classification: • Mechanical, Production, Marine and Textile Engineering • Electrical Engineering, Computer Science and Nuclear Engineering • Civil and Architecture Engineering • Chemical Engineering and Applied Sciences • Environmental Engineering
期刊最新文献
DESNet: Real-time human pose estimation for sports applications combining IoT and deep learning Investigation on structure stability and damage mechanism of cemented paste backfill under the coupling effect of water-static load Modeling snow accumulation in the bogie region caused by train slipstream based on sliding mesh and particle capture criteria Numerically pricing American and European options using a time fractional Black–Scholes model in financial decision-making Trends and opportunities in renewable energy investment in Saudi Arabia: Insights for achieving vision 2030 and enhancing environmental sustainability
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1