{"title":"利用症状数据诊断多种慢性病的机器学习模型性能评估","authors":"Kulvinder Singh, Sanjeev Dhawan, Deepanshu Mehla","doi":"10.3103/S0146411624700093","DOIUrl":null,"url":null,"abstract":"<p>An on-time and accurate analysis of the problem is essential to prevent and treat any illness. The utilization of machine learning (ML) for diagnosing a wide range of diseases is increasingly prevalent in the field of medical science based on symptoms experienced during diseases. The main objective of the research is to make a comparative analysis of different ML models that accurately predicts diseases based on symptoms. To do so, the dataset obtained from Kaggle comprises information related to 41 diseases including their symptoms which are in 17 columns with their weights. In other words, we have a group of 17 symptoms, independent variables (symptoms differ for each patient except some), and 1 target variable (disease). Furthermore, preprocessing is applied to data to make it suitable for the various machine learning approaches. After that, three scaling techniques are used: standard scaling, min-max, and PCA (principal component analysis) for normalization. The present study utilized a variety of ML models, which includes LGB classifier, KNN, random forest (RF), CatBoost, support vector machine (SVM), XGBoost, and a hybrid model that combined two existing approaches (SVM and XGBoost). Each scaling technique was assessed using various evaluative parameters such as root mean squared error (RMSE), cross-validation score, R2 score, mean squared error and accuracy. Random forest, LGB classifier, and XGBoost demonstrated superior performance when compared and evaluated to one another with regards to accuracy, R2 score, and RMSE, achieving scores of 98, 96, and 2.08% respectively. Also, the RF algorithm required less computation time in contrast to other scaling techniques, particularly in standard scaling, with a time of only 0.129 s.</p>","PeriodicalId":46238,"journal":{"name":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","volume":"58 2","pages":"195 - 208"},"PeriodicalIF":0.6000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Evaluation of Machine Learning Models for Multiple Chronic Disease Diagnosis Using Symptom Data\",\"authors\":\"Kulvinder Singh, Sanjeev Dhawan, Deepanshu Mehla\",\"doi\":\"10.3103/S0146411624700093\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>An on-time and accurate analysis of the problem is essential to prevent and treat any illness. The utilization of machine learning (ML) for diagnosing a wide range of diseases is increasingly prevalent in the field of medical science based on symptoms experienced during diseases. The main objective of the research is to make a comparative analysis of different ML models that accurately predicts diseases based on symptoms. To do so, the dataset obtained from Kaggle comprises information related to 41 diseases including their symptoms which are in 17 columns with their weights. In other words, we have a group of 17 symptoms, independent variables (symptoms differ for each patient except some), and 1 target variable (disease). Furthermore, preprocessing is applied to data to make it suitable for the various machine learning approaches. After that, three scaling techniques are used: standard scaling, min-max, and PCA (principal component analysis) for normalization. The present study utilized a variety of ML models, which includes LGB classifier, KNN, random forest (RF), CatBoost, support vector machine (SVM), XGBoost, and a hybrid model that combined two existing approaches (SVM and XGBoost). Each scaling technique was assessed using various evaluative parameters such as root mean squared error (RMSE), cross-validation score, R2 score, mean squared error and accuracy. Random forest, LGB classifier, and XGBoost demonstrated superior performance when compared and evaluated to one another with regards to accuracy, R2 score, and RMSE, achieving scores of 98, 96, and 2.08% respectively. Also, the RF algorithm required less computation time in contrast to other scaling techniques, particularly in standard scaling, with a time of only 0.129 s.</p>\",\"PeriodicalId\":46238,\"journal\":{\"name\":\"AUTOMATIC CONTROL AND COMPUTER SCIENCES\",\"volume\":\"58 2\",\"pages\":\"195 - 208\"},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2024-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AUTOMATIC CONTROL AND COMPUTER SCIENCES\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://link.springer.com/article/10.3103/S0146411624700093\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.3103/S0146411624700093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Performance Evaluation of Machine Learning Models for Multiple Chronic Disease Diagnosis Using Symptom Data
An on-time and accurate analysis of the problem is essential to prevent and treat any illness. The utilization of machine learning (ML) for diagnosing a wide range of diseases is increasingly prevalent in the field of medical science based on symptoms experienced during diseases. The main objective of the research is to make a comparative analysis of different ML models that accurately predicts diseases based on symptoms. To do so, the dataset obtained from Kaggle comprises information related to 41 diseases including their symptoms which are in 17 columns with their weights. In other words, we have a group of 17 symptoms, independent variables (symptoms differ for each patient except some), and 1 target variable (disease). Furthermore, preprocessing is applied to data to make it suitable for the various machine learning approaches. After that, three scaling techniques are used: standard scaling, min-max, and PCA (principal component analysis) for normalization. The present study utilized a variety of ML models, which includes LGB classifier, KNN, random forest (RF), CatBoost, support vector machine (SVM), XGBoost, and a hybrid model that combined two existing approaches (SVM and XGBoost). Each scaling technique was assessed using various evaluative parameters such as root mean squared error (RMSE), cross-validation score, R2 score, mean squared error and accuracy. Random forest, LGB classifier, and XGBoost demonstrated superior performance when compared and evaluated to one another with regards to accuracy, R2 score, and RMSE, achieving scores of 98, 96, and 2.08% respectively. Also, the RF algorithm required less computation time in contrast to other scaling techniques, particularly in standard scaling, with a time of only 0.129 s.
期刊介绍:
Automatic Control and Computer Sciences is a peer reviewed journal that publishes articles on• Control systems, cyber-physical system, real-time systems, robotics, smart sensors, embedded intelligence • Network information technologies, information security, statistical methods of data processing, distributed artificial intelligence, complex systems modeling, knowledge representation, processing and management • Signal and image processing, machine learning, machine perception, computer vision