使用Naïve贝叶斯算法和机器学习技术的心脏病预测模型

Mariam W. Yousef, K. Batiha
{"title":"使用Naïve贝叶斯算法和机器学习技术的心脏病预测模型","authors":"Mariam W. Yousef, K. Batiha","doi":"10.14419/IJET.V10I1.31310","DOIUrl":null,"url":null,"abstract":"These days, heart disease comes to be one of the major health problems which have affected the lives of people in the whole world. Moreover, death due to heart disease is increasing day by day. So the heart disease prediction systems play an important role in the prevention of heart problems. Where these prediction systems assist doctors in making the right decision to diagnose heart disease easily. The existing prediction systems suffering from the high dimensionality problem of selected features that increase the prediction time and decrease the performance accuracy of the prediction due to many redundant or irrelevant features. Therefore, this paper aims to provide a solution of the dimensionality problem by proposing a new mixed model for heart disease prediction based on (Naive Bayes method, and machine learning classifiers). In this study, we proposed a new heart disease prediction model (NB-SKDR) based on the Naive Bayes algorithm (NB) and several machine learning techniques including Support Vector Machine, K-Nearest Neighbors, Decision Tree, and Random Forest. This prediction model consists of three main phases which include: preprocessing, feature selection, and classification. The main objective of this proposed model is to improve the performance of the prediction system and finding the best subset of features. This proposed approach uses the Naive Bayes technique based on the Bayes theorem to select the best subset of features for the next classification phase, also to handle the high dimensionality problem by avoiding unnecessary features and select only the important ones in an attempt to improve the efficiency and accuracy of classifiers. This method is able to reduce the number of features from 13 to 6 which are (age, gender, blood pressure, fasting blood sugar, cholesterol, exercise induce engine) by determining the dependency between a set of attributes. The dependent attributes are the attributes in which an attribute depends on the other attribute in deciding the value of the class attribute. The dependency between attributes is measured by the conditional probability, which can be easily computed by Bayes theorem. Moreover, in the classification phase, the proposed system uses different classification algorithms such as (DT Decision Tree, RF Random Forest, SVM Support Vector machine, KNN Nearest Neighbors) as a classifiers for predicting whether a patient has heart disease or not. The model is trained and evaluated using the Cleveland Heart Disease database, which contains 13 features and 303 samples. Different algorithms use different rules for producing different representations of knowledge. So, the selection of algorithms to build our model is based on their performance. In this work, we applied and compared several classification algorithms which are (DT, SVM, RF, and KNN) to identify the best-suited algorithm to achieve high accuracy in the prediction of heart disease. After combining the Naive Bayes method with each one of these previous classifiers the performance of these combines algorithms is evaluated by different performance metrics such as (Specificity, Sensitivity, and Accuracy). Where the experimental results show that out of these four classification models, the combination between the Naive Bayes feature selection approach and the SVM RBF classifier can predict heart disease with the highest accuracy of 98%. Finally, the proposed approach is compared with another two systems which developed based on two different approaches in the feature selection step. The first system, based on the Genetic Algorithm (GA) technique, and the second uses the Principal Component Analysis (PCA) technique. Consequently, the comparison proved that the Naive Bayes selection approach of the proposed system is better than the GA and PCA approach in terms of prediction accuracy.","PeriodicalId":14142,"journal":{"name":"International journal of engineering and technology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Heart Disease Prediction Model Using Naïve Bayes Algorithm and Machine Learning Techniques\",\"authors\":\"Mariam W. Yousef, K. Batiha\",\"doi\":\"10.14419/IJET.V10I1.31310\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"These days, heart disease comes to be one of the major health problems which have affected the lives of people in the whole world. Moreover, death due to heart disease is increasing day by day. So the heart disease prediction systems play an important role in the prevention of heart problems. Where these prediction systems assist doctors in making the right decision to diagnose heart disease easily. The existing prediction systems suffering from the high dimensionality problem of selected features that increase the prediction time and decrease the performance accuracy of the prediction due to many redundant or irrelevant features. Therefore, this paper aims to provide a solution of the dimensionality problem by proposing a new mixed model for heart disease prediction based on (Naive Bayes method, and machine learning classifiers). In this study, we proposed a new heart disease prediction model (NB-SKDR) based on the Naive Bayes algorithm (NB) and several machine learning techniques including Support Vector Machine, K-Nearest Neighbors, Decision Tree, and Random Forest. This prediction model consists of three main phases which include: preprocessing, feature selection, and classification. The main objective of this proposed model is to improve the performance of the prediction system and finding the best subset of features. This proposed approach uses the Naive Bayes technique based on the Bayes theorem to select the best subset of features for the next classification phase, also to handle the high dimensionality problem by avoiding unnecessary features and select only the important ones in an attempt to improve the efficiency and accuracy of classifiers. This method is able to reduce the number of features from 13 to 6 which are (age, gender, blood pressure, fasting blood sugar, cholesterol, exercise induce engine) by determining the dependency between a set of attributes. The dependent attributes are the attributes in which an attribute depends on the other attribute in deciding the value of the class attribute. The dependency between attributes is measured by the conditional probability, which can be easily computed by Bayes theorem. Moreover, in the classification phase, the proposed system uses different classification algorithms such as (DT Decision Tree, RF Random Forest, SVM Support Vector machine, KNN Nearest Neighbors) as a classifiers for predicting whether a patient has heart disease or not. The model is trained and evaluated using the Cleveland Heart Disease database, which contains 13 features and 303 samples. Different algorithms use different rules for producing different representations of knowledge. So, the selection of algorithms to build our model is based on their performance. In this work, we applied and compared several classification algorithms which are (DT, SVM, RF, and KNN) to identify the best-suited algorithm to achieve high accuracy in the prediction of heart disease. After combining the Naive Bayes method with each one of these previous classifiers the performance of these combines algorithms is evaluated by different performance metrics such as (Specificity, Sensitivity, and Accuracy). Where the experimental results show that out of these four classification models, the combination between the Naive Bayes feature selection approach and the SVM RBF classifier can predict heart disease with the highest accuracy of 98%. Finally, the proposed approach is compared with another two systems which developed based on two different approaches in the feature selection step. The first system, based on the Genetic Algorithm (GA) technique, and the second uses the Principal Component Analysis (PCA) technique. Consequently, the comparison proved that the Naive Bayes selection approach of the proposed system is better than the GA and PCA approach in terms of prediction accuracy.\",\"PeriodicalId\":14142,\"journal\":{\"name\":\"International journal of engineering and technology\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of engineering and technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14419/IJET.V10I1.31310\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of engineering and technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14419/IJET.V10I1.31310","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

如今,心脏病已成为影响全世界人民生活的主要健康问题之一。此外,因心脏病而死亡的人数日益增加。因此,心脏疾病预测系统在预防心脏疾病中起着重要的作用。这些预测系统帮助医生做出正确的决定,轻松诊断心脏病。现有的预测系统存在特征选择的高维问题,由于存在大量冗余或不相关的特征,增加了预测时间,降低了预测的性能精度。因此,本文旨在通过提出一种新的基于朴素贝叶斯方法和机器学习分类器的心脏病预测混合模型来解决维数问题。在本研究中,我们提出了一种基于朴素贝叶斯算法(NB)和支持向量机、k近邻、决策树和随机森林等多种机器学习技术的心脏病预测模型(NB- skdr)。该预测模型包括预处理、特征选择和分类三个主要阶段。该模型的主要目标是提高预测系统的性能并找到最佳的特征子集。该方法利用基于贝叶斯定理的朴素贝叶斯技术来选择下一阶段分类的最佳特征子集,并通过避免不必要的特征,只选择重要的特征来处理高维问题,以提高分类器的效率和准确性。该方法能够通过确定一组属性之间的依赖关系,将特征(年龄、性别、血压、空腹血糖、胆固醇、运动诱导引擎)的数量从13个减少到6个。依赖属性是指一个属性在决定类属性的值时依赖于另一个属性的属性。属性之间的依赖关系由条件概率来衡量,条件概率可以很容易地由贝叶斯定理计算出来。此外,在分类阶段,提出的系统使用不同的分类算法,如(DT决策树,RF随机森林,SVM支持向量机,KNN近邻)作为分类器来预测患者是否患有心脏病。该模型使用克利夫兰心脏病数据库进行训练和评估,该数据库包含13个特征和303个样本。不同的算法使用不同的规则来产生不同的知识表示。因此,选择算法来构建我们的模型是基于它们的性能。在这项工作中,我们应用并比较了几种分类算法(DT, SVM, RF和KNN),以确定最适合的算法,以实现对心脏病的高精度预测。在将朴素贝叶斯方法与这些先前的分类器相结合之后,这些组合算法的性能通过不同的性能指标进行评估,例如(特异性,灵敏度和准确性)。其中实验结果表明,在这四种分类模型中,朴素贝叶斯特征选择方法与SVM RBF分类器相结合的方法预测心脏病的准确率最高,达到98%。最后,在特征选择步骤中将该方法与基于两种不同方法开发的另外两种系统进行了比较。第一个系统基于遗传算法(GA)技术,第二个系统使用主成分分析(PCA)技术。结果表明,该系统的朴素贝叶斯选择方法在预测精度上优于遗传算法和主成分分析方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Heart Disease Prediction Model Using Naïve Bayes Algorithm and Machine Learning Techniques
These days, heart disease comes to be one of the major health problems which have affected the lives of people in the whole world. Moreover, death due to heart disease is increasing day by day. So the heart disease prediction systems play an important role in the prevention of heart problems. Where these prediction systems assist doctors in making the right decision to diagnose heart disease easily. The existing prediction systems suffering from the high dimensionality problem of selected features that increase the prediction time and decrease the performance accuracy of the prediction due to many redundant or irrelevant features. Therefore, this paper aims to provide a solution of the dimensionality problem by proposing a new mixed model for heart disease prediction based on (Naive Bayes method, and machine learning classifiers). In this study, we proposed a new heart disease prediction model (NB-SKDR) based on the Naive Bayes algorithm (NB) and several machine learning techniques including Support Vector Machine, K-Nearest Neighbors, Decision Tree, and Random Forest. This prediction model consists of three main phases which include: preprocessing, feature selection, and classification. The main objective of this proposed model is to improve the performance of the prediction system and finding the best subset of features. This proposed approach uses the Naive Bayes technique based on the Bayes theorem to select the best subset of features for the next classification phase, also to handle the high dimensionality problem by avoiding unnecessary features and select only the important ones in an attempt to improve the efficiency and accuracy of classifiers. This method is able to reduce the number of features from 13 to 6 which are (age, gender, blood pressure, fasting blood sugar, cholesterol, exercise induce engine) by determining the dependency between a set of attributes. The dependent attributes are the attributes in which an attribute depends on the other attribute in deciding the value of the class attribute. The dependency between attributes is measured by the conditional probability, which can be easily computed by Bayes theorem. Moreover, in the classification phase, the proposed system uses different classification algorithms such as (DT Decision Tree, RF Random Forest, SVM Support Vector machine, KNN Nearest Neighbors) as a classifiers for predicting whether a patient has heart disease or not. The model is trained and evaluated using the Cleveland Heart Disease database, which contains 13 features and 303 samples. Different algorithms use different rules for producing different representations of knowledge. So, the selection of algorithms to build our model is based on their performance. In this work, we applied and compared several classification algorithms which are (DT, SVM, RF, and KNN) to identify the best-suited algorithm to achieve high accuracy in the prediction of heart disease. After combining the Naive Bayes method with each one of these previous classifiers the performance of these combines algorithms is evaluated by different performance metrics such as (Specificity, Sensitivity, and Accuracy). Where the experimental results show that out of these four classification models, the combination between the Naive Bayes feature selection approach and the SVM RBF classifier can predict heart disease with the highest accuracy of 98%. Finally, the proposed approach is compared with another two systems which developed based on two different approaches in the feature selection step. The first system, based on the Genetic Algorithm (GA) technique, and the second uses the Principal Component Analysis (PCA) technique. Consequently, the comparison proved that the Naive Bayes selection approach of the proposed system is better than the GA and PCA approach in terms of prediction accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Comparison of the Fluorescence Properties of Biological Solutions and Aerosols A Hybrid Machine Learning and Fuzzy Inference Approach with UAV for Indoor Virus Contamination Risk Influence of Yarn Hairiness on the Mechanical Properties of Unidirectional Jute Polyester Composites The Building Material Use Study of the Eco Learning Camps Design for Elementary and Middle School Students: A Case Study Feasibility Study of the Location Selection for Oil Distribution Center with Sensitivity Analysis Case Study: A Sample Oil Company
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1