Identification of Important Factors in the Diagnosis of Breast Cancer Cells Using Machine Learning Models and Principal Component Analysis

Suejit Pechprasarn, Ohmthong Wattanapermpool, Maninya Warunlawan, Pornchaya Homsud, Thumpussorn Akarajarasroj
{"title":"Identification of Important Factors in the Diagnosis of Breast Cancer Cells Using Machine Learning Models and Principal Component Analysis","authors":"Suejit Pechprasarn, Ohmthong Wattanapermpool, Maninya Warunlawan, Pornchaya Homsud, Thumpussorn Akarajarasroj","doi":"10.59796/jcst.v13n3.2023.700","DOIUrl":null,"url":null,"abstract":"Breast cancer (BC) is now identified as a disease with a significant impact on morbidity and mortality that is growing and widespread worldwide. This study uses a publicly available clinical dataset of 699 patients from the University of Wisconsin with 9 variables: (1) clump thickness, (2) uniformity of cell size, (3) uniformity of cell shape, (4) marginal adhesion, (5) single epithelial cell size, (6) bare nuclei, (7) bland chromatin, (8) normal nucleoli, and (9) mitoses. This dataset has been used for many studies in the past to pinpoint critical factors in patient diagnosis. Here, we use this data to ensure its unbiasedness and accuracy. We then apply principal component analysis and machine learning models to identify factors in diagnosing a malignant or benign tumor. We investigate and compare the classification accuracy of different machine learning models, including tree, linear discriminant, quadratic discriminant, logistic regression, naive Bayes, support vector machine (SVM), K-nearest neighbor (KNN), ensemble, neural network, and kernel. The best models that can achieve the highest accuracy are medium Gaussian SVM, coarse Gaussian SVM, and cosine KNN, with an accuracy of 96.5%. The principal component analysis method is then performed to identify crucial components and build an accurate model with fewer parameters. The medium Gaussian SVM has the highest cross-validation classification accuracy of 96.98% and requires only three predictors: normal nucleoli, bare nuclei, and cell size uniformity.","PeriodicalId":36369,"journal":{"name":"Journal of Current Science and Technology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Current Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.59796/jcst.v13n3.2023.700","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Multidisciplinary","Score":null,"Total":0}
引用次数: 0

Abstract

Breast cancer (BC) is now identified as a disease with a significant impact on morbidity and mortality that is growing and widespread worldwide. This study uses a publicly available clinical dataset of 699 patients from the University of Wisconsin with 9 variables: (1) clump thickness, (2) uniformity of cell size, (3) uniformity of cell shape, (4) marginal adhesion, (5) single epithelial cell size, (6) bare nuclei, (7) bland chromatin, (8) normal nucleoli, and (9) mitoses. This dataset has been used for many studies in the past to pinpoint critical factors in patient diagnosis. Here, we use this data to ensure its unbiasedness and accuracy. We then apply principal component analysis and machine learning models to identify factors in diagnosing a malignant or benign tumor. We investigate and compare the classification accuracy of different machine learning models, including tree, linear discriminant, quadratic discriminant, logistic regression, naive Bayes, support vector machine (SVM), K-nearest neighbor (KNN), ensemble, neural network, and kernel. The best models that can achieve the highest accuracy are medium Gaussian SVM, coarse Gaussian SVM, and cosine KNN, with an accuracy of 96.5%. The principal component analysis method is then performed to identify crucial components and build an accurate model with fewer parameters. The medium Gaussian SVM has the highest cross-validation classification accuracy of 96.98% and requires only three predictors: normal nucleoli, bare nuclei, and cell size uniformity.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用机器学习模型和主成分分析识别乳腺癌细胞诊断中的重要因素
乳腺癌(BC)现在被确定为一种对发病率和死亡率有重大影响的疾病,在世界范围内日益增长和广泛。本研究使用来自威斯康星大学的699例患者的公开临床数据集,其中有9个变量:(1)团块厚度,(2)细胞大小均匀性,(3)细胞形状均匀性,(4)边缘粘附,(5)单个上皮细胞大小,(6)裸核,(7)淡色染色质,(8)正常核仁,(9)有丝分裂。该数据集在过去的许多研究中被用于确定患者诊断中的关键因素。在这里,我们使用这些数据来确保其无偏性和准确性。然后,我们应用主成分分析和机器学习模型来识别诊断恶性或良性肿瘤的因素。我们研究并比较了不同机器学习模型的分类精度,包括树、线性判别、二次判别、逻辑回归、朴素贝叶斯、支持向量机(SVM)、k近邻(KNN)、集成、神经网络和核。准确率最高的模型是中高斯SVM、粗高斯SVM和余弦KNN,准确率为96.5%。然后采用主成分分析法识别关键成分,以较少的参数建立准确的模型。中高斯支持向量机的交叉验证分类准确率最高,达到96.98%,只需要三个预测因子:正常核仁、裸核和细胞大小均匀性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Current Science and Technology
Journal of Current Science and Technology Multidisciplinary-Multidisciplinary
CiteScore
0.80
自引率
0.00%
发文量
0
期刊最新文献
Optimization of Sulfated Polysaccharides Extraction from Gracilaria fisheri Obtained Through Microwave-Assisted Extraction Effects of Acupuncture on Autonomic Nervous System Parameters and Salivary Cortisol Level Among Mental Stress University Students: A Pilot Randomized Controlled Trial Decomposition and Holt-Winters Enhanced by the Whale Optimization Algorithm for Forecasting the Amount of Water Inflow into the Large Dam Reservoirs in Southern Thailand Psychometric Evaluation of the Thai Male Depression Risk Scale (MDRS-TH) Automatic Melanoma Skin Cancer Detection and Segmentation using Snakecut Algorithm
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1