Effect of Grid Search and Hyper Parameter Tuned Pipeline with Various Classifiers and PCA for Breast Cancer Detection

Q3 Medicine Current Signal Transduction Therapy Pub Date : 2022-07-15 DOI:10.2174/1574362417666220715105527

Sushovan Chaudhury, Nilesh Shelke, Zahraa M. Rashid, K. Sau

{"title":"Effect of Grid Search and Hyper Parameter Tuned Pipeline with Various Classifiers and PCA for Breast Cancer Detection","authors":"Sushovan Chaudhury, Nilesh Shelke, Zahraa M. Rashid, K. Sau","doi":"10.2174/1574362417666220715105527","DOIUrl":null,"url":null,"abstract":"\n\nThe study of breast cancer detection begins with the WBCD dataset for most researchers, as it is a very well-known dataset. We use this dataset as a benchmark in this paper to study ML algorithms like SVM, DT, RF, KNN, NB classifiers, Logistic Regression, Extra Trees, Bagging Classifiers with hard and soft voting, Ensemble techniques and Extreme Gradient Boosting classifiers like XG Boost and 2 deep learning models with regularization and without regularization.\n\n\n\nThe primary objective is to revisit how the existing classifiers fare with the WBCD dataset and suggest a method with Grid search and Randomized search by selecting the best hyper-parameters to apply with and without PCA and check if WBCD dataset can be classified in lesser time without compromising accuracy.\n\n\n\nWe explore PCA as a feature extraction technique in this dataset and use techniques like Feature Scaling K Fold stratified cross-validation technique, K best etc. We implement Grid search CV along with PCA in the pipeline to tune the hyper parameters across various classifiers and reduce the training and prediction time without compromising accuracy. Last but not the least, this paper also compares the accuracy, precision and recall of various ML techniques for manually selected features by observing the feature importance score and the correlation matrix.\n\n\n\nIn our experiment with all features, we get an accuracy of 97.9 per cent for Extra trees and Ensemble techniques with RF, KNN and Extra Trees with soft voting strategy and using feature selection with PCA and grid search we get an accuracy of 99.1 per cent with SVM (kernel trick). We also demonstrate that the running time of training and prediction also reduces if hyper parameters of classifiers are tuned appropriately which is taken care of by Grid and Randomized Hyper Parameter Grids.\n\n\n\nIt is shown in this paper that Feature subset selection or feature ranking may not be the best way and not the only way to be applied on WBCD dataset along with PCA. In datasets where features are closely correlated , a method for hyper parameter tuning using either Grid or Randomized Search can be accompanied with PCA to extract the best feature combinations and then fed into the classifiers to get good accuracy scores and can be executed in a much quicker time.\n","PeriodicalId":10868,"journal":{"name":"Current Signal Transduction Therapy","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Signal Transduction Therapy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/1574362417666220715105527","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 1

Abstract

The study of breast cancer detection begins with the WBCD dataset for most researchers, as it is a very well-known dataset. We use this dataset as a benchmark in this paper to study ML algorithms like SVM, DT, RF, KNN, NB classifiers, Logistic Regression, Extra Trees, Bagging Classifiers with hard and soft voting, Ensemble techniques and Extreme Gradient Boosting classifiers like XG Boost and 2 deep learning models with regularization and without regularization. The primary objective is to revisit how the existing classifiers fare with the WBCD dataset and suggest a method with Grid search and Randomized search by selecting the best hyper-parameters to apply with and without PCA and check if WBCD dataset can be classified in lesser time without compromising accuracy. We explore PCA as a feature extraction technique in this dataset and use techniques like Feature Scaling K Fold stratified cross-validation technique, K best etc. We implement Grid search CV along with PCA in the pipeline to tune the hyper parameters across various classifiers and reduce the training and prediction time without compromising accuracy. Last but not the least, this paper also compares the accuracy, precision and recall of various ML techniques for manually selected features by observing the feature importance score and the correlation matrix. In our experiment with all features, we get an accuracy of 97.9 per cent for Extra trees and Ensemble techniques with RF, KNN and Extra Trees with soft voting strategy and using feature selection with PCA and grid search we get an accuracy of 99.1 per cent with SVM (kernel trick). We also demonstrate that the running time of training and prediction also reduces if hyper parameters of classifiers are tuned appropriately which is taken care of by Grid and Randomized Hyper Parameter Grids. It is shown in this paper that Feature subset selection or feature ranking may not be the best way and not the only way to be applied on WBCD dataset along with PCA. In datasets where features are closely correlated , a method for hyper parameter tuning using either Grid or Randomized Search can be accompanied with PCA to extract the best feature combinations and then fed into the classifiers to get good accuracy scores and can be executed in a much quicker time.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

网格搜索和多分类器、主成分分析的超参数调谐流水线在乳腺癌症检测中的作用

对于大多数研究人员来说，乳腺癌检测的研究始于WBCD数据集，因为它是一个非常知名的数据集。我们在本文中使用该数据集作为基准来研究ML算法，如SVM, DT, RF, KNN, NB分类器，逻辑回归，额外树，带硬投票和软投票的Bagging分类器，集成技术和极端梯度增强分类器，如XG Boost，以及2个带正则化和不带正则化的深度学习模型。主要目标是重新审视现有分类器如何处理WBCD数据集，并通过选择最佳超参数来应用和不应用PCA，提出网格搜索和随机搜索的方法，并检查WBCD数据集是否可以在更短的时间内分类而不影响准确性。我们探索了PCA作为该数据集的特征提取技术，并使用了特征缩放、K折叠分层交叉验证技术、K best等技术。我们将网格搜索CV与PCA一起在流水线中实现，以调整各种分类器的超参数，在不影响准确性的情况下减少训练和预测时间。最后，通过观察特征重要性评分和相关矩阵，比较了各种机器学习技术在人工选择特征时的准确率、精密度和召回率。在我们所有特征的实验中，我们获得了97.9%的额外树和集成技术与RF, KNN和额外树与软投票策略的准确率，使用PCA和网格搜索的特征选择，我们获得了99.1%的准确率SVM(核技巧)。通过网格和随机化超参数网格对分类器的超参数进行适当的调整，可以减少训练和预测的运行时间。本文表明，特征子集选择或特征排序可能不是与PCA一起应用于WBCD数据集的最佳方法，也不是唯一的方法。在特征密切相关的数据集中，使用网格或随机搜索的超参数调优方法可以与PCA相结合，提取最佳特征组合，然后输入到分类器中，以获得良好的准确率分数，并且可以在更快的时间内执行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Current Signal Transduction Therapy 医学-药学

CiteScore

1.70

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： In recent years a breakthrough has occurred in our understanding of the molecular pathomechanisms of human diseases whereby most of our diseases are related to intra and intercellular communication disorders. The concept of signal transduction therapy has got into the front line of modern drug research, and a multidisciplinary approach is being used to identify and treat signaling disorders. The journal publishes timely in-depth reviews, research article and drug clinical trial studies in the field of signal transduction therapy. Thematic issues are also published to cover selected areas of signal transduction therapy. Coverage of the field includes genomics, proteomics, medicinal chemistry and the relevant diseases involved in signaling e.g. cancer, neurodegenerative and inflammatory diseases. Current Signal Transduction Therapy is an essential journal for all involved in drug design and discovery.