利用症状数据诊断多种慢性病的机器学习模型性能评估

IF 0.6 Q4 AUTOMATION & CONTROL SYSTEMS AUTOMATIC CONTROL AND COMPUTER SCIENCES Pub Date : 2024-05-06 DOI:10.3103/S0146411624700093

Kulvinder Singh, Sanjeev Dhawan, Deepanshu Mehla

{"title":"利用症状数据诊断多种慢性病的机器学习模型性能评估","authors":"Kulvinder Singh, Sanjeev Dhawan, Deepanshu Mehla","doi":"10.3103/S0146411624700093","DOIUrl":null,"url":null,"abstract":"<p>An on-time and accurate analysis of the problem is essential to prevent and treat any illness. The utilization of machine learning (ML) for diagnosing a wide range of diseases is increasingly prevalent in the field of medical science based on symptoms experienced during diseases. The main objective of the research is to make a comparative analysis of different ML models that accurately predicts diseases based on symptoms. To do so, the dataset obtained from Kaggle comprises information related to 41 diseases including their symptoms which are in 17 columns with their weights. In other words, we have a group of 17 symptoms, independent variables (symptoms differ for each patient except some), and 1 target variable (disease). Furthermore, preprocessing is applied to data to make it suitable for the various machine learning approaches. After that, three scaling techniques are used: standard scaling, min-max, and PCA (principal component analysis) for normalization. The present study utilized a variety of ML models, which includes LGB classifier, KNN, random forest (RF), CatBoost, support vector machine (SVM), XGBoost, and a hybrid model that combined two existing approaches (SVM and XGBoost). Each scaling technique was assessed using various evaluative parameters such as root mean squared error (RMSE), cross-validation score, R2 score, mean squared error and accuracy. Random forest, LGB classifier, and XGBoost demonstrated superior performance when compared and evaluated to one another with regards to accuracy, R2 score, and RMSE, achieving scores of 98, 96, and 2.08% respectively. Also, the RF algorithm required less computation time in contrast to other scaling techniques, particularly in standard scaling, with a time of only 0.129 s.</p>","PeriodicalId":46238,"journal":{"name":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","volume":"58 2","pages":"195 - 208"},"PeriodicalIF":0.6000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Evaluation of Machine Learning Models for Multiple Chronic Disease Diagnosis Using Symptom Data\",\"authors\":\"Kulvinder Singh, Sanjeev Dhawan, Deepanshu Mehla\",\"doi\":\"10.3103/S0146411624700093\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>An on-time and accurate analysis of the problem is essential to prevent and treat any illness. The utilization of machine learning (ML) for diagnosing a wide range of diseases is increasingly prevalent in the field of medical science based on symptoms experienced during diseases. The main objective of the research is to make a comparative analysis of different ML models that accurately predicts diseases based on symptoms. To do so, the dataset obtained from Kaggle comprises information related to 41 diseases including their symptoms which are in 17 columns with their weights. In other words, we have a group of 17 symptoms, independent variables (symptoms differ for each patient except some), and 1 target variable (disease). Furthermore, preprocessing is applied to data to make it suitable for the various machine learning approaches. After that, three scaling techniques are used: standard scaling, min-max, and PCA (principal component analysis) for normalization. The present study utilized a variety of ML models, which includes LGB classifier, KNN, random forest (RF), CatBoost, support vector machine (SVM), XGBoost, and a hybrid model that combined two existing approaches (SVM and XGBoost). Each scaling technique was assessed using various evaluative parameters such as root mean squared error (RMSE), cross-validation score, R2 score, mean squared error and accuracy. Random forest, LGB classifier, and XGBoost demonstrated superior performance when compared and evaluated to one another with regards to accuracy, R2 score, and RMSE, achieving scores of 98, 96, and 2.08% respectively. Also, the RF algorithm required less computation time in contrast to other scaling techniques, particularly in standard scaling, with a time of only 0.129 s.</p>\",\"PeriodicalId\":46238,\"journal\":{\"name\":\"AUTOMATIC CONTROL AND COMPUTER SCIENCES\",\"volume\":\"58 2\",\"pages\":\"195 - 208\"},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2024-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AUTOMATIC CONTROL AND COMPUTER SCIENCES\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://link.springer.com/article/10.3103/S0146411624700093\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.3103/S0146411624700093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

摘要及时准确地分析问题对于预防和治疗任何疾病都至关重要。在医学领域，利用机器学习（ML）根据疾病症状诊断各种疾病的做法日益盛行。本研究的主要目的是对根据症状准确预测疾病的不同 ML 模型进行比较分析。为此，我们从 Kaggle 获取的数据集包含 41 种疾病的相关信息，其中包括 17 列疾病症状及其权重。换句话说，我们有一组 17 个症状、自变量（除部分症状外，每个患者的症状都不同）和 1 个目标变量（疾病）。此外，我们还对数据进行了预处理，使其适合各种机器学习方法。然后，使用三种缩放技术：标准缩放、最小-最大缩放和用于归一化的 PCA（主成分分析）。本研究采用了多种 ML 模型，其中包括 LGB 分类器、KNN、随机森林 (RF)、CatBoost、支持向量机 (SVM)、XGBoost 以及结合了两种现有方法（SVM 和 XGBoost）的混合模型。每种缩放技术都使用了各种评估参数，如均方根误差 (RMSE)、交叉验证得分、R2 得分、均方误差和准确率。随机森林、LGB 分类器和 XGBoost 在准确率、R2 分数和 RMSE 方面的相互比较和评估中表现出了卓越的性能，分别达到了 98%、96% 和 2.08%。此外，与其他缩放技术相比，RF 算法所需的计算时间更短，尤其是在标准缩放中，仅需 0.129 秒。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Performance Evaluation of Machine Learning Models for Multiple Chronic Disease Diagnosis Using Symptom Data

An on-time and accurate analysis of the problem is essential to prevent and treat any illness. The utilization of machine learning (ML) for diagnosing a wide range of diseases is increasingly prevalent in the field of medical science based on symptoms experienced during diseases. The main objective of the research is to make a comparative analysis of different ML models that accurately predicts diseases based on symptoms. To do so, the dataset obtained from Kaggle comprises information related to 41 diseases including their symptoms which are in 17 columns with their weights. In other words, we have a group of 17 symptoms, independent variables (symptoms differ for each patient except some), and 1 target variable (disease). Furthermore, preprocessing is applied to data to make it suitable for the various machine learning approaches. After that, three scaling techniques are used: standard scaling, min-max, and PCA (principal component analysis) for normalization. The present study utilized a variety of ML models, which includes LGB classifier, KNN, random forest (RF), CatBoost, support vector machine (SVM), XGBoost, and a hybrid model that combined two existing approaches (SVM and XGBoost). Each scaling technique was assessed using various evaluative parameters such as root mean squared error (RMSE), cross-validation score, R2 score, mean squared error and accuracy. Random forest, LGB classifier, and XGBoost demonstrated superior performance when compared and evaluated to one another with regards to accuracy, R2 score, and RMSE, achieving scores of 98, 96, and 2.08% respectively. Also, the RF algorithm required less computation time in contrast to other scaling techniques, particularly in standard scaling, with a time of only 0.129 s.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

AUTOMATIC CONTROL AND COMPUTER SCIENCES AUTOMATION & CONTROL SYSTEMS-

CiteScore

1.70

自引率

22.20%

发文量

期刊介绍： Automatic Control and Computer Sciences is a peer reviewed journal that publishes articles on• Control systems, cyber-physical system, real-time systems, robotics, smart sensors, embedded intelligence • Network information technologies, information security, statistical methods of data processing, distributed artificial intelligence, complex systems modeling, knowledge representation, processing and management • Signal and image processing, machine learning, machine perception, computer vision