{"title":"Optimization of multidimensional feature engineering and data partitioning strategies in heart disease prediction models","authors":"","doi":"10.1016/j.aej.2024.09.037","DOIUrl":null,"url":null,"abstract":"<div><p>The relentless rise in heart disease incidence, a leading global cause of death, presents a significant public health challenge. Precise prediction of heart disease risk and early interventions are crucial. This study investigates the performance improvement of heart disease prediction models using machine learning and deep learning algorithms. Initially, we utilized the Heart Failure Prediction Dataset from Kaggle. After preprocessing to ensure data quality, three distinct feature engineering techniques were applied: PCA for dimensionality reduction, ET for feature selection, and Pearson's correlation coefficient for feature selection. We assessed their impact on model performance. The dataset was then partitioned into three different data split ratios—1:9, 2:8, and 3:7—to determine their specific effects on model performance. Twelve machine learning classifiers—LGBM, Adaboost, XGB, RF, DT, KNN, LR, GNB, ET, SVC, GB, and Bagging—were trained and evaluated based on five key metrics: accuracy, recall, precision, F1 score, and training time. The influence of different feature engineering methods and data partitioning ratios on model performance were systematically analyzed using paired-sample t-tests. Among the feature engineering methods compared, the Bagging classifier, when combined with feature selection via ET, exhibited superior performance. It achieved an accuracy of 97.48 % and an F1-Score of 97.48 % with a data split ratio of 1:9 between the test and training sets. With a 2:8 split, the accuracy was 94.96 % and the F1-Score was 94.95 %. For a 3:7 split, the accuracy was 94.12 % and the F1-Score was 94.11 %. Paired sample T-test results indicate that feature selection using Pearson correlation coefficient can shorten training duration, but this also leads to a decline in classifier performance. After applying PCA dimensionality reduction, compared with the control group, there was no significant difference in the training efficiency and efficacy of the classifier. However, feature selection through ET significantly reduced the training time for various classifiers while ensuring their performance.</p></div>","PeriodicalId":7484,"journal":{"name":"alexandria engineering journal","volume":null,"pages":null},"PeriodicalIF":6.2000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1110016824010603/pdfft?md5=8b290daf5a11a49e747ad3d4eb26b34e&pid=1-s2.0-S1110016824010603-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"alexandria engineering journal","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110016824010603","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
The relentless rise in heart disease incidence, a leading global cause of death, presents a significant public health challenge. Precise prediction of heart disease risk and early interventions are crucial. This study investigates the performance improvement of heart disease prediction models using machine learning and deep learning algorithms. Initially, we utilized the Heart Failure Prediction Dataset from Kaggle. After preprocessing to ensure data quality, three distinct feature engineering techniques were applied: PCA for dimensionality reduction, ET for feature selection, and Pearson's correlation coefficient for feature selection. We assessed their impact on model performance. The dataset was then partitioned into three different data split ratios—1:9, 2:8, and 3:7—to determine their specific effects on model performance. Twelve machine learning classifiers—LGBM, Adaboost, XGB, RF, DT, KNN, LR, GNB, ET, SVC, GB, and Bagging—were trained and evaluated based on five key metrics: accuracy, recall, precision, F1 score, and training time. The influence of different feature engineering methods and data partitioning ratios on model performance were systematically analyzed using paired-sample t-tests. Among the feature engineering methods compared, the Bagging classifier, when combined with feature selection via ET, exhibited superior performance. It achieved an accuracy of 97.48 % and an F1-Score of 97.48 % with a data split ratio of 1:9 between the test and training sets. With a 2:8 split, the accuracy was 94.96 % and the F1-Score was 94.95 %. For a 3:7 split, the accuracy was 94.12 % and the F1-Score was 94.11 %. Paired sample T-test results indicate that feature selection using Pearson correlation coefficient can shorten training duration, but this also leads to a decline in classifier performance. After applying PCA dimensionality reduction, compared with the control group, there was no significant difference in the training efficiency and efficacy of the classifier. However, feature selection through ET significantly reduced the training time for various classifiers while ensuring their performance.
期刊介绍:
Alexandria Engineering Journal is an international journal devoted to publishing high quality papers in the field of engineering and applied science. Alexandria Engineering Journal is cited in the Engineering Information Services (EIS) and the Chemical Abstracts (CA). The papers published in Alexandria Engineering Journal are grouped into five sections, according to the following classification:
• Mechanical, Production, Marine and Textile Engineering
• Electrical Engineering, Computer Science and Nuclear Engineering
• Civil and Architecture Engineering
• Chemical Engineering and Applied Sciences
• Environmental Engineering