Machine Learning-Based Risk Factor Analysis and Prediction Model Construction for the Occurrence of Chronic Heart Failure: Health Ecologic Study.

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS JMIR Medical Informatics Pub Date : 2025-01-31 DOI:10.2196/64972

Qian Xu, Xue Cai, Ruicong Yu, Yueyue Zheng, Guanjie Chen, Hui Sun, Tianyun Gao, Cuirong Xu, Jing Sun

{"title":"Machine Learning-Based Risk Factor Analysis and Prediction Model Construction for the Occurrence of Chronic Heart Failure: Health Ecologic Study.","authors":"Qian Xu, Xue Cai, Ruicong Yu, Yueyue Zheng, Guanjie Chen, Hui Sun, Tianyun Gao, Cuirong Xu, Jing Sun","doi":"10.2196/64972","DOIUrl":null,"url":null,"abstract":"Background: Chronic heart failure (CHF) is a serious threat to human health, with high morbidity and mortality rates, imposing a heavy burden on the health care system and society. With the abundance of medical data and the rapid development of machine learning (ML) technologies, new opportunities are provided for in-depth investigation of the mechanisms of CHF and the construction of predictive models. The introduction of health ecology research methodology enables a comprehensive dissection of CHF risk factors from a wider range of environmental, social, and individual factors. This not only helps to identify high-risk groups at an early stage but also provides a scientific basis for the development of precise prevention and intervention strategies.Objective: This study aims to use ML to construct a predictive model of the risk of occurrence of CHF and analyze the risk of CHF from a health ecology perspective.Methods: This study sourced data from the Jackson Heart Study database. Stringent data preprocessing procedures were implemented, which included meticulous management of missing values and the standardization of data. Principal component analysis and random forest (RF) were used as feature selection techniques. Subsequently, several ML models, namely decision tree, RF, extreme gradient boosting, adaptive boosting (AdaBoost), support vector machine, naive Bayes model, multilayer perceptron, and bootstrap forest, were constructed, and their performance was evaluated. The effectiveness of the models was validated through internal validation using a 10-fold cross-validation approach on the training and validation sets. In addition, the performance metrics of each model, including accuracy, precision, sensitivity, F1-score, and area under the curve (AUC), were compared. After selecting the best model, we used hyperparameter optimization to construct a better model.Results: RF-selected features (21 in total) had an average root mean square error of 0.30, outperforming principal component analysis. Synthetic Minority Oversampling Technique and Edited Nearest Neighbors showed better accuracy in data balancing. The AdaBoost model was most effective with an AUC of 0.86, accuracy of 75.30%, precision of 0.86, sensitivity of 0.69, and F1-score of 0.76. Validation on the training and validation sets through 10-fold cross-validation gave an AUC of 0.97, an accuracy of 91.27%, a precision of 0.94, a sensitivity of 0.92, and an F1-score of 0.94. After random search processing, the accuracy and AUC of AdaBoost improved. Its accuracy was 77.68% and its AUC was 0.86.Conclusions: This study offered insights into CHF risk prediction. Future research should focus on prospective studies, diverse data, advanced techniques, longitudinal studies, and exploring factor interactions for better CHF prevention and management.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e64972"},"PeriodicalIF":3.8000,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11829185/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/64972","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Chronic heart failure (CHF) is a serious threat to human health, with high morbidity and mortality rates, imposing a heavy burden on the health care system and society. With the abundance of medical data and the rapid development of machine learning (ML) technologies, new opportunities are provided for in-depth investigation of the mechanisms of CHF and the construction of predictive models. The introduction of health ecology research methodology enables a comprehensive dissection of CHF risk factors from a wider range of environmental, social, and individual factors. This not only helps to identify high-risk groups at an early stage but also provides a scientific basis for the development of precise prevention and intervention strategies.

Objective: This study aims to use ML to construct a predictive model of the risk of occurrence of CHF and analyze the risk of CHF from a health ecology perspective.

Methods: This study sourced data from the Jackson Heart Study database. Stringent data preprocessing procedures were implemented, which included meticulous management of missing values and the standardization of data. Principal component analysis and random forest (RF) were used as feature selection techniques. Subsequently, several ML models, namely decision tree, RF, extreme gradient boosting, adaptive boosting (AdaBoost), support vector machine, naive Bayes model, multilayer perceptron, and bootstrap forest, were constructed, and their performance was evaluated. The effectiveness of the models was validated through internal validation using a 10-fold cross-validation approach on the training and validation sets. In addition, the performance metrics of each model, including accuracy, precision, sensitivity, F₁-score, and area under the curve (AUC), were compared. After selecting the best model, we used hyperparameter optimization to construct a better model.

Results: RF-selected features (21 in total) had an average root mean square error of 0.30, outperforming principal component analysis. Synthetic Minority Oversampling Technique and Edited Nearest Neighbors showed better accuracy in data balancing. The AdaBoost model was most effective with an AUC of 0.86, accuracy of 75.30%, precision of 0.86, sensitivity of 0.69, and F₁-score of 0.76. Validation on the training and validation sets through 10-fold cross-validation gave an AUC of 0.97, an accuracy of 91.27%, a precision of 0.94, a sensitivity of 0.92, and an F₁-score of 0.94. After random search processing, the accuracy and AUC of AdaBoost improved. Its accuracy was 77.68% and its AUC was 0.86.

Conclusions: This study offered insights into CHF risk prediction. Future research should focus on prospective studies, diverse data, advanced techniques, longitudinal studies, and exploring factor interactions for better CHF prevention and management.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于机器学习的慢性心力衰竭发生危险因素分析及预测模型构建：健康生态学研究。

背景：慢性心力衰竭（Chronic heart failure， CHF）是严重威胁人类健康的疾病，发病率和死亡率高，给卫生保健系统和社会带来沉重负担。随着医学数据的丰富和机器学习（ML）技术的快速发展，为深入研究CHF机制和构建预测模型提供了新的机遇。健康生态学研究方法的引入使人们能够从更广泛的环境、社会和个人因素中全面剖析心力衰竭的危险因素。这不仅有助于在早期阶段确定高危人群，而且为制定精确的预防和干预策略提供了科学依据。目的：本研究旨在利用ML构建CHF发生风险的预测模型，从健康生态学角度分析CHF的风险。方法：本研究的数据来源于杰克逊心脏研究数据库。实施了严格的数据预处理程序，其中包括对缺失值的细致管理和数据的标准化。采用主成分分析和随机森林（RF）作为特征选择技术。随后，构建了决策树、RF、极端梯度增强、自适应增强（AdaBoost）、支持向量机、朴素贝叶斯模型、多层感知器和自举森林等机器学习模型，并对其性能进行了评价。通过对训练集和验证集使用10倍交叉验证方法进行内部验证，验证了模型的有效性。此外，还比较了每个模型的性能指标，包括准确性、精密度、灵敏度、f1评分和曲线下面积（AUC）。在选择最佳模型后，我们使用超参数优化来构建更好的模型。结果：rf选择的特征（共21个）的平均均方根误差为0.30，优于主成分分析。合成少数过采样技术和编辑近邻技术在数据平衡中显示出更好的准确性。AdaBoost模型最有效，AUC为0.86，准确度为75.30%，精密度为0.86，灵敏度为0.69，f1评分为0.76。对训练集和验证集进行10倍交叉验证，AUC为0.97，准确度为91.27%，精密度为0.94，灵敏度为0.92，f1得分为0.94。经过随机搜索处理后，AdaBoost的准确率和AUC有所提高。其准确度为77.68%，AUC为0.86。结论：本研究为CHF风险预测提供了新的思路。未来的研究应注重前瞻性研究、多样化的数据、先进的技术、纵向研究，并探索因素间的相互作用，以更好地预防和管理CHF。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.