Knowledge Discovery from Confusion Matrix of Pruned CART in Imbalanced Microarray Data Ovarian Cancer Classification

Ni Kadek Emik Sapitri, Umu Sa’adah, Nur Shofianah
{"title":"Knowledge Discovery from Confusion Matrix of Pruned CART in Imbalanced Microarray Data Ovarian Cancer Classification","authors":"Ni Kadek Emik Sapitri, Umu Sa’adah, Nur Shofianah","doi":"10.15294/sji.v11i1.50077","DOIUrl":null,"url":null,"abstract":"Purpose: The results of microarray data analysis is important in cancer diagnosis, especially in early stages asymptomatic cancers like ovarian cancer. One of the challenges in analyzing microarray data is the problem of imbalanced data. Unfortunately, research that carries out cancer classification from microarray data often ignores this challenge, so that it doesn’t use appropriate evaluation metrics. It makes the results biased towards the majority class. This study uses a popular evaluation metric “accuracy” and an evaluation metric that is suitable for imbalanced data “balanced accuracy (BA)” to gain information from the confusion matrix regarding accuracy and BA values in case of ovarian cancer classification.Methods: This study use Classification and Regression Tree (CART) as the classifier. CART optimized by pruning. CART optimal is determined from the results of CART complexity analysis and confusion matrix.Results: The confusion matrix and CART interpretations in this research show that CART with low complexity is still able to predict majority class respondents well. However, when none of the data in the minority class was classified correctly, the accuracy value was still quite high, namely 86.97% and 88.03% respectively at the training and testing stages, while the BA value at both stages was only 50%.Novelty: It is very important to ensure that the evaluation metrics used match the characteristics of the data being processed. This research illustrate the difference between accuracy and BA. It concluded that that classification of an imbalanced dataset without doing resampling can use BA as evaluation metric, because based on the results, BA is more fairly to both classes.","PeriodicalId":30781,"journal":{"name":"Scientific Journal of Informatics","volume":"4 8","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Journal of Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15294/sji.v11i1.50077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: The results of microarray data analysis is important in cancer diagnosis, especially in early stages asymptomatic cancers like ovarian cancer. One of the challenges in analyzing microarray data is the problem of imbalanced data. Unfortunately, research that carries out cancer classification from microarray data often ignores this challenge, so that it doesn’t use appropriate evaluation metrics. It makes the results biased towards the majority class. This study uses a popular evaluation metric “accuracy” and an evaluation metric that is suitable for imbalanced data “balanced accuracy (BA)” to gain information from the confusion matrix regarding accuracy and BA values in case of ovarian cancer classification.Methods: This study use Classification and Regression Tree (CART) as the classifier. CART optimized by pruning. CART optimal is determined from the results of CART complexity analysis and confusion matrix.Results: The confusion matrix and CART interpretations in this research show that CART with low complexity is still able to predict majority class respondents well. However, when none of the data in the minority class was classified correctly, the accuracy value was still quite high, namely 86.97% and 88.03% respectively at the training and testing stages, while the BA value at both stages was only 50%.Novelty: It is very important to ensure that the evaluation metrics used match the characteristics of the data being processed. This research illustrate the difference between accuracy and BA. It concluded that that classification of an imbalanced dataset without doing resampling can use BA as evaluation metric, because based on the results, BA is more fairly to both classes.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从不平衡微阵列数据卵巢癌分类中剪枝 CART 的混淆矩阵中发现知识
目的:微阵列数据分析结果对癌症诊断非常重要,尤其是卵巢癌等早期无症状癌症。分析微阵列数据的挑战之一是不平衡数据问题。遗憾的是,根据微阵列数据进行癌症分类的研究往往忽视了这一挑战,因此没有使用适当的评估指标。这使得结果偏向于多数类。本研究使用流行的评价指标 "准确率 "和适用于不平衡数据的评价指标 "平衡准确率(BA)",从混淆矩阵中获取有关卵巢癌分类中准确率和平衡准确率值的信息:本研究使用分类回归树(CART)作为分类器。通过剪枝优化 CART。根据 CART 复杂性分析和混淆矩阵的结果确定 CART 最佳值:本研究中的混淆矩阵和 CART 解释表明,低复杂度的 CART 仍能很好地预测大多数类别的受访者。新颖性:确保所使用的评价指标与所处理数据的特征相匹配非常重要。这项研究说明了准确率和 BA 之间的区别。研究得出结论,在不进行重采样的情况下对不平衡数据集进行分类,可以使用 BA 作为评价指标,因为根据结果,BA 对两类数据都更公平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
13
审稿时长
24 weeks
期刊最新文献
A Comparative Study of Random Forest and Double Random Forest Models from View Points of Their Interpretability Comparative Analysis of LSTM Neural Network and SVM for USD Exchange Rate Prediction: A Study on Different Training Data Scenarios Knowledge Discovery from Confusion Matrix of Pruned CART in Imbalanced Microarray Data Ovarian Cancer Classification Comparison of Discriminant Analysis and Support Vector Machine on Mixed Categorical and Continuous Independent Variables for COVID-19 Patients Data The Comparison of K-Nearest Neighbors and Random Forest Algorithm to Recognize Indonesian Sign Language in a Real-Time
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1