Sushovan Chaudhury, Nilesh Shelke, Zahraa M. Rashid, K. Sau
{"title":"网格搜索和多分类器、主成分分析的超参数调谐流水线在乳腺癌症检测中的作用","authors":"Sushovan Chaudhury, Nilesh Shelke, Zahraa M. Rashid, K. Sau","doi":"10.2174/1574362417666220715105527","DOIUrl":null,"url":null,"abstract":"\n\nThe study of breast cancer detection begins with the WBCD dataset for most researchers, as it is a very well-known dataset. We use this dataset as a benchmark in this paper to study ML algorithms like SVM, DT, RF, KNN, NB classifiers, Logistic Regression, Extra Trees, Bagging Classifiers with hard and soft voting, Ensemble techniques and Extreme Gradient Boosting classifiers like XG Boost and 2 deep learning models with regularization and without regularization.\n\n\n\nThe primary objective is to revisit how the existing classifiers fare with the WBCD dataset and suggest a method with Grid search and Randomized search by selecting the best hyper-parameters to apply with and without PCA and check if WBCD dataset can be classified in lesser time without compromising accuracy.\n\n\n\nWe explore PCA as a feature extraction technique in this dataset and use techniques like Feature Scaling K Fold stratified cross-validation technique, K best etc. We implement Grid search CV along with PCA in the pipeline to tune the hyper parameters across various classifiers and reduce the training and prediction time without compromising accuracy. Last but not the least, this paper also compares the accuracy, precision and recall of various ML techniques for manually selected features by observing the feature importance score and the correlation matrix.\n\n\n\nIn our experiment with all features, we get an accuracy of 97.9 per cent for Extra trees and Ensemble techniques with RF, KNN and Extra Trees with soft voting strategy and using feature selection with PCA and grid search we get an accuracy of 99.1 per cent with SVM (kernel trick). We also demonstrate that the running time of training and prediction also reduces if hyper parameters of classifiers are tuned appropriately which is taken care of by Grid and Randomized Hyper Parameter Grids.\n\n\n\nIt is shown in this paper that Feature subset selection or feature ranking may not be the best way and not the only way to be applied on WBCD dataset along with PCA. In datasets where features are closely correlated , a method for hyper parameter tuning using either Grid or Randomized Search can be accompanied with PCA to extract the best feature combinations and then fed into the classifiers to get good accuracy scores and can be executed in a much quicker time.\n","PeriodicalId":10868,"journal":{"name":"Current Signal Transduction Therapy","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Effect of Grid Search and Hyper Parameter Tuned Pipeline with Various Classifiers and PCA for Breast Cancer Detection\",\"authors\":\"Sushovan Chaudhury, Nilesh Shelke, Zahraa M. Rashid, K. Sau\",\"doi\":\"10.2174/1574362417666220715105527\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n\\nThe study of breast cancer detection begins with the WBCD dataset for most researchers, as it is a very well-known dataset. We use this dataset as a benchmark in this paper to study ML algorithms like SVM, DT, RF, KNN, NB classifiers, Logistic Regression, Extra Trees, Bagging Classifiers with hard and soft voting, Ensemble techniques and Extreme Gradient Boosting classifiers like XG Boost and 2 deep learning models with regularization and without regularization.\\n\\n\\n\\nThe primary objective is to revisit how the existing classifiers fare with the WBCD dataset and suggest a method with Grid search and Randomized search by selecting the best hyper-parameters to apply with and without PCA and check if WBCD dataset can be classified in lesser time without compromising accuracy.\\n\\n\\n\\nWe explore PCA as a feature extraction technique in this dataset and use techniques like Feature Scaling K Fold stratified cross-validation technique, K best etc. We implement Grid search CV along with PCA in the pipeline to tune the hyper parameters across various classifiers and reduce the training and prediction time without compromising accuracy. Last but not the least, this paper also compares the accuracy, precision and recall of various ML techniques for manually selected features by observing the feature importance score and the correlation matrix.\\n\\n\\n\\nIn our experiment with all features, we get an accuracy of 97.9 per cent for Extra trees and Ensemble techniques with RF, KNN and Extra Trees with soft voting strategy and using feature selection with PCA and grid search we get an accuracy of 99.1 per cent with SVM (kernel trick). We also demonstrate that the running time of training and prediction also reduces if hyper parameters of classifiers are tuned appropriately which is taken care of by Grid and Randomized Hyper Parameter Grids.\\n\\n\\n\\nIt is shown in this paper that Feature subset selection or feature ranking may not be the best way and not the only way to be applied on WBCD dataset along with PCA. In datasets where features are closely correlated , a method for hyper parameter tuning using either Grid or Randomized Search can be accompanied with PCA to extract the best feature combinations and then fed into the classifiers to get good accuracy scores and can be executed in a much quicker time.\\n\",\"PeriodicalId\":10868,\"journal\":{\"name\":\"Current Signal Transduction Therapy\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current Signal Transduction Therapy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2174/1574362417666220715105527\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Signal Transduction Therapy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/1574362417666220715105527","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}
Effect of Grid Search and Hyper Parameter Tuned Pipeline with Various Classifiers and PCA for Breast Cancer Detection
The study of breast cancer detection begins with the WBCD dataset for most researchers, as it is a very well-known dataset. We use this dataset as a benchmark in this paper to study ML algorithms like SVM, DT, RF, KNN, NB classifiers, Logistic Regression, Extra Trees, Bagging Classifiers with hard and soft voting, Ensemble techniques and Extreme Gradient Boosting classifiers like XG Boost and 2 deep learning models with regularization and without regularization.
The primary objective is to revisit how the existing classifiers fare with the WBCD dataset and suggest a method with Grid search and Randomized search by selecting the best hyper-parameters to apply with and without PCA and check if WBCD dataset can be classified in lesser time without compromising accuracy.
We explore PCA as a feature extraction technique in this dataset and use techniques like Feature Scaling K Fold stratified cross-validation technique, K best etc. We implement Grid search CV along with PCA in the pipeline to tune the hyper parameters across various classifiers and reduce the training and prediction time without compromising accuracy. Last but not the least, this paper also compares the accuracy, precision and recall of various ML techniques for manually selected features by observing the feature importance score and the correlation matrix.
In our experiment with all features, we get an accuracy of 97.9 per cent for Extra trees and Ensemble techniques with RF, KNN and Extra Trees with soft voting strategy and using feature selection with PCA and grid search we get an accuracy of 99.1 per cent with SVM (kernel trick). We also demonstrate that the running time of training and prediction also reduces if hyper parameters of classifiers are tuned appropriately which is taken care of by Grid and Randomized Hyper Parameter Grids.
It is shown in this paper that Feature subset selection or feature ranking may not be the best way and not the only way to be applied on WBCD dataset along with PCA. In datasets where features are closely correlated , a method for hyper parameter tuning using either Grid or Randomized Search can be accompanied with PCA to extract the best feature combinations and then fed into the classifiers to get good accuracy scores and can be executed in a much quicker time.
期刊介绍:
In recent years a breakthrough has occurred in our understanding of the molecular pathomechanisms of human diseases whereby most of our diseases are related to intra and intercellular communication disorders. The concept of signal transduction therapy has got into the front line of modern drug research, and a multidisciplinary approach is being used to identify and treat signaling disorders.
The journal publishes timely in-depth reviews, research article and drug clinical trial studies in the field of signal transduction therapy. Thematic issues are also published to cover selected areas of signal transduction therapy. Coverage of the field includes genomics, proteomics, medicinal chemistry and the relevant diseases involved in signaling e.g. cancer, neurodegenerative and inflammatory diseases. Current Signal Transduction Therapy is an essential journal for all involved in drug design and discovery.