基于监督机器学习算法的简易癌症分类决策支持系统和web应用。

IF 2.5 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Cancer Informatics Pub Date : 2023-01-01 DOI:10.1177/11769351221147244

K Chandrashekar, Anagha S Setlur, Adithya Sabhapathi C, Satyam Suresh Raiker, Satyam Singh, Vidya Niranjan

{"title":"基于监督机器学习算法的简易癌症分类决策支持系统和web应用。","authors":"K Chandrashekar, Anagha S Setlur, Adithya Sabhapathi C, Satyam Suresh Raiker, Satyam Singh, Vidya Niranjan","doi":"10.1177/11769351221147244","DOIUrl":null,"url":null,"abstract":"Using a decision support system (DSS) that classifies various cancers provides support to the clinicians/researchers to make better decisions that can aid in early cancer diagnosis, thereby reducing chances of incorrect disease diagnosis. Thus, this work aimed at designing a classification model that can predict accurately for 5 different cancer types comprising of 20 cancer exomes, using the mutations identified from whole exome cancer analysis. Initially, a basic model was designed using supervised machine learning classification algorithms such as K-nearest neighbor (KNN), support vector machine (SVM), decision tree, naïve bayes and random forest (RF), among which decision tree and random forest performed better in terms of preliminary model accuracy. However, output predictions were incorrect due to less training scores. Thus, 16 essential features were then selected for model improvement using 2 approaches. All imbalanced datasets were balanced using SMOTE. In the first approach, all features from 20 cancer exome datasets were trained and models were designed using decision tree and random forest. Balanced datasets for decision tree model showed an accuracy of 77%, while with the RF model, the accuracy improved to 82% where all 5 cancer types were predicted correctly. Area under the curve for RF model was closer to 1, than decision tree model. In the second approach, all 15 datasets were trained, while 5 were tested. However, only 2 cancer types were predicted correctly. To cross validate RF model, Matthew's correlation co-efficient (MCC) test was performed. For method 1, the MCC test and MCC cross validation was found to be 0.7796 and 0.9356 respectively. Likewise, for second approach, MCC was observed to be 0.9365, corroborating the accuracy of the designed model. The model was successfully deployed using Streamlit as a web application for easy use. This study presents insights for allowing easy cancer classifications.","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351221147244"},"PeriodicalIF":2.5000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c2/da/10.1177_11769351221147244.PMC9880585.pdf","citationCount":"1","resultStr":"{\"title\":\"Decision Support System and Web-Application Using Supervised Machine Learning Algorithms for Easy Cancer Classifications.\",\"authors\":\"K Chandrashekar, Anagha S Setlur, Adithya Sabhapathi C, Satyam Suresh Raiker, Satyam Singh, Vidya Niranjan\",\"doi\":\"10.1177/11769351221147244\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Using a decision support system (DSS) that classifies various cancers provides support to the clinicians/researchers to make better decisions that can aid in early cancer diagnosis, thereby reducing chances of incorrect disease diagnosis. Thus, this work aimed at designing a classification model that can predict accurately for 5 different cancer types comprising of 20 cancer exomes, using the mutations identified from whole exome cancer analysis. Initially, a basic model was designed using supervised machine learning classification algorithms such as K-nearest neighbor (KNN), support vector machine (SVM), decision tree, naïve bayes and random forest (RF), among which decision tree and random forest performed better in terms of preliminary model accuracy. However, output predictions were incorrect due to less training scores. Thus, 16 essential features were then selected for model improvement using 2 approaches. All imbalanced datasets were balanced using SMOTE. In the first approach, all features from 20 cancer exome datasets were trained and models were designed using decision tree and random forest. Balanced datasets for decision tree model showed an accuracy of 77%, while with the RF model, the accuracy improved to 82% where all 5 cancer types were predicted correctly. Area under the curve for RF model was closer to 1, than decision tree model. In the second approach, all 15 datasets were trained, while 5 were tested. However, only 2 cancer types were predicted correctly. To cross validate RF model, Matthew's correlation co-efficient (MCC) test was performed. For method 1, the MCC test and MCC cross validation was found to be 0.7796 and 0.9356 respectively. Likewise, for second approach, MCC was observed to be 0.9365, corroborating the accuracy of the designed model. The model was successfully deployed using Streamlit as a web application for easy use. This study presents insights for allowing easy cancer classifications.\",\"PeriodicalId\":35418,\"journal\":{\"name\":\"Cancer Informatics\",\"volume\":\"22 \",\"pages\":\"11769351221147244\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c2/da/10.1177_11769351221147244.PMC9880585.pdf\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cancer Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/11769351221147244\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/11769351221147244","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 1

摘要

使用对各种癌症进行分类的决策支持系统(DSS)为临床医生/研究人员提供支持，帮助他们做出更好的决策，从而有助于早期癌症诊断，从而减少错误疾病诊断的机会。因此，本工作旨在设计一个分类模型，该模型可以使用从全外显子组癌症分析中鉴定的突变，准确预测由20个癌症外显子组组成的5种不同癌症类型。首先，使用k近邻(KNN)、支持向量机(SVM)、决策树、naïve贝叶斯和随机森林(RF)等监督机器学习分类算法设计基本模型，其中决策树和随机森林在模型初步精度上表现较好。然而，由于训练分数较少，输出预测是不正确的。因此，然后选择16个基本特征，使用2种方法进行模型改进。使用SMOTE对所有不平衡数据集进行平衡。在第一种方法中，对来自20个癌症外显子组数据集的所有特征进行训练，并使用决策树和随机森林设计模型。决策树模型的平衡数据集显示准确率为77%，而RF模型的准确率提高到82%，其中所有5种癌症类型都被正确预测。与决策树模型相比，射频模型的曲线下面积更接近于1。在第二种方法中，所有15个数据集都进行了训练，而5个数据集进行了测试。然而，只有两种癌症类型预测正确。为了交叉验证RF模型，采用Matthew’s相关系数检验(MCC)。方法1的MCC检验和MCC交叉验证分别为0.7796和0.9356。同样，对于第二种方法，观察到MCC为0.9365，证实了设计模型的准确性。该模型已成功部署，使用Streamlit作为web应用程序，方便使用。这项研究为简化癌症分类提供了见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Decision Support System and Web-Application Using Supervised Machine Learning Algorithms for Easy Cancer Classifications.

Using a decision support system (DSS) that classifies various cancers provides support to the clinicians/researchers to make better decisions that can aid in early cancer diagnosis, thereby reducing chances of incorrect disease diagnosis. Thus, this work aimed at designing a classification model that can predict accurately for 5 different cancer types comprising of 20 cancer exomes, using the mutations identified from whole exome cancer analysis. Initially, a basic model was designed using supervised machine learning classification algorithms such as K-nearest neighbor (KNN), support vector machine (SVM), decision tree, naïve bayes and random forest (RF), among which decision tree and random forest performed better in terms of preliminary model accuracy. However, output predictions were incorrect due to less training scores. Thus, 16 essential features were then selected for model improvement using 2 approaches. All imbalanced datasets were balanced using SMOTE. In the first approach, all features from 20 cancer exome datasets were trained and models were designed using decision tree and random forest. Balanced datasets for decision tree model showed an accuracy of 77%, while with the RF model, the accuracy improved to 82% where all 5 cancer types were predicted correctly. Area under the curve for RF model was closer to 1, than decision tree model. In the second approach, all 15 datasets were trained, while 5 were tested. However, only 2 cancer types were predicted correctly. To cross validate RF model, Matthew's correlation co-efficient (MCC) test was performed. For method 1, the MCC test and MCC cross validation was found to be 0.7796 and 0.9356 respectively. Likewise, for second approach, MCC was observed to be 0.9365, corroborating the accuracy of the designed model. The model was successfully deployed using Streamlit as a web application for easy use. This study presents insights for allowing easy cancer classifications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Cancer Informatics Medicine-Oncology

CiteScore

3.00

自引率

5.00%

发文量

审稿时长

8 weeks

期刊介绍： The field of cancer research relies on advances in many other disciplines, including omics technology, mass spectrometry, radio imaging, computer science, and biostatistics. Cancer Informatics provides open access to peer-reviewed high-quality manuscripts reporting bioinformatics analysis of molecular genetics and/or clinical data pertaining to cancer, emphasizing the use of machine learning, artificial intelligence, statistical algorithms, advanced imaging techniques, data visualization, and high-throughput technologies. As the leading journal dedicated exclusively to the report of the use of computational methods in cancer research and practice, Cancer Informatics leverages methodological improvements in systems biology, genomics, proteomics, metabolomics, and molecular biochemistry into the fields of cancer detection, treatment, classification, risk-prediction, prevention, outcome, and modeling.