Optimization of multidimensional feature engineering and data partitioning strategies in heart disease prediction models

IF 6.2 2区工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY alexandria engineering journal Pub Date : 2024-09-18 DOI:10.1016/j.aej.2024.09.037

{"title":"Optimization of multidimensional feature engineering and data partitioning strategies in heart disease prediction models","authors":"","doi":"10.1016/j.aej.2024.09.037","DOIUrl":null,"url":null,"abstract":"<div><p>The relentless rise in heart disease incidence, a leading global cause of death, presents a significant public health challenge. Precise prediction of heart disease risk and early interventions are crucial. This study investigates the performance improvement of heart disease prediction models using machine learning and deep learning algorithms. Initially, we utilized the Heart Failure Prediction Dataset from Kaggle. After preprocessing to ensure data quality, three distinct feature engineering techniques were applied: PCA for dimensionality reduction, ET for feature selection, and Pearson's correlation coefficient for feature selection. We assessed their impact on model performance. The dataset was then partitioned into three different data split ratios—1:9, 2:8, and 3:7—to determine their specific effects on model performance. Twelve machine learning classifiers—LGBM, Adaboost, XGB, RF, DT, KNN, LR, GNB, ET, SVC, GB, and Bagging—were trained and evaluated based on five key metrics: accuracy, recall, precision, F1 score, and training time. The influence of different feature engineering methods and data partitioning ratios on model performance were systematically analyzed using paired-sample t-tests. Among the feature engineering methods compared, the Bagging classifier, when combined with feature selection via ET, exhibited superior performance. It achieved an accuracy of 97.48 % and an F1-Score of 97.48 % with a data split ratio of 1:9 between the test and training sets. With a 2:8 split, the accuracy was 94.96 % and the F1-Score was 94.95 %. For a 3:7 split, the accuracy was 94.12 % and the F1-Score was 94.11 %. Paired sample T-test results indicate that feature selection using Pearson correlation coefficient can shorten training duration, but this also leads to a decline in classifier performance. After applying PCA dimensionality reduction, compared with the control group, there was no significant difference in the training efficiency and efficacy of the classifier. However, feature selection through ET significantly reduced the training time for various classifiers while ensuring their performance.</p></div>","PeriodicalId":7484,"journal":{"name":"alexandria engineering journal","volume":null,"pages":null},"PeriodicalIF":6.2000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1110016824010603/pdfft?md5=8b290daf5a11a49e747ad3d4eb26b34e&pid=1-s2.0-S1110016824010603-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"alexandria engineering journal","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110016824010603","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

The relentless rise in heart disease incidence, a leading global cause of death, presents a significant public health challenge. Precise prediction of heart disease risk and early interventions are crucial. This study investigates the performance improvement of heart disease prediction models using machine learning and deep learning algorithms. Initially, we utilized the Heart Failure Prediction Dataset from Kaggle. After preprocessing to ensure data quality, three distinct feature engineering techniques were applied: PCA for dimensionality reduction, ET for feature selection, and Pearson's correlation coefficient for feature selection. We assessed their impact on model performance. The dataset was then partitioned into three different data split ratios—1:9, 2:8, and 3:7—to determine their specific effects on model performance. Twelve machine learning classifiers—LGBM, Adaboost, XGB, RF, DT, KNN, LR, GNB, ET, SVC, GB, and Bagging—were trained and evaluated based on five key metrics: accuracy, recall, precision, F1 score, and training time. The influence of different feature engineering methods and data partitioning ratios on model performance were systematically analyzed using paired-sample t-tests. Among the feature engineering methods compared, the Bagging classifier, when combined with feature selection via ET, exhibited superior performance. It achieved an accuracy of 97.48 % and an F1-Score of 97.48 % with a data split ratio of 1:9 between the test and training sets. With a 2:8 split, the accuracy was 94.96 % and the F1-Score was 94.95 %. For a 3:7 split, the accuracy was 94.12 % and the F1-Score was 94.11 %. Paired sample T-test results indicate that feature selection using Pearson correlation coefficient can shorten training duration, but this also leads to a decline in classifier performance. After applying PCA dimensionality reduction, compared with the control group, there was no significant difference in the training efficiency and efficacy of the classifier. However, feature selection through ET significantly reduced the training time for various classifiers while ensuring their performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

优化心脏病预测模型中的多维特征工程和数据分区策略

心脏病是导致全球死亡的主要原因之一，其发病率的持续上升给公共卫生带来了巨大挑战。精确预测心脏病风险和早期干预至关重要。本研究探讨了利用机器学习和深度学习算法提高心脏病预测模型的性能。最初，我们使用了 Kaggle 的心衰预测数据集。在进行预处理以确保数据质量后，我们采用了三种不同的特征工程技术：用于降维的 PCA、用于特征选择的 ET 和用于特征选择的皮尔逊相关系数。我们评估了这些技术对模型性能的影响。然后，我们将数据集分成三种不同的数据分割比例--1:9、2:8 和 3:7，以确定它们对模型性能的具体影响。对 12 种机器学习分类器--LGBM、Adaboost、XGB、RF、DT、KNN、LR、GNB、ET、SVC、GB 和 Bagging 进行了训练，并根据准确率、召回率、精确度、F1 分数和训练时间这五个关键指标进行了评估。使用配对样本 t 检验系统分析了不同特征工程方法和数据划分比例对模型性能的影响。在所比较的特征工程方法中，通过 ET 结合特征选择的 Bagging 分类器表现出更优越的性能。在测试集和训练集的数据分割比例为 1:9 时，分类器的准确率达到 97.48%，F1 分数达到 97.48%。在数据分割比例为 2:8 时，准确率为 94.96 %，F1 分数为 94.95 %。数据分割比例为 3:7 时，准确率为 94.12 %，F1 分数为 94.11 %。配对样本 T 检验结果表明，使用皮尔逊相关系数进行特征选择可以缩短训练时间，但也会导致分类器性能下降。应用 PCA 降维后，与对照组相比，分类器的训练效率和效果没有显著差异。然而，通过 ET 进行特征选择，在确保分类器性能的同时，大大缩短了各种分类器的训练时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

alexandria engineering journal Engineering-General Engineering

CiteScore

11.20

自引率

4.40%

发文量

1015

审稿时长

43 days

期刊介绍： Alexandria Engineering Journal is an international journal devoted to publishing high quality papers in the field of engineering and applied science. Alexandria Engineering Journal is cited in the Engineering Information Services (EIS) and the Chemical Abstracts (CA). The papers published in Alexandria Engineering Journal are grouped into five sections, according to the following classification: • Mechanical, Production, Marine and Textile Engineering • Electrical Engineering, Computer Science and Nuclear Engineering • Civil and Architecture Engineering • Chemical Engineering and Applied Sciences • Environmental Engineering