{"title":"从不平衡微阵列数据卵巢癌分类中剪枝 CART 的混淆矩阵中发现知识","authors":"Ni Kadek Emik Sapitri, Umu Sa’adah, Nur Shofianah","doi":"10.15294/sji.v11i1.50077","DOIUrl":null,"url":null,"abstract":"Purpose: The results of microarray data analysis is important in cancer diagnosis, especially in early stages asymptomatic cancers like ovarian cancer. One of the challenges in analyzing microarray data is the problem of imbalanced data. Unfortunately, research that carries out cancer classification from microarray data often ignores this challenge, so that it doesn’t use appropriate evaluation metrics. It makes the results biased towards the majority class. This study uses a popular evaluation metric “accuracy” and an evaluation metric that is suitable for imbalanced data “balanced accuracy (BA)” to gain information from the confusion matrix regarding accuracy and BA values in case of ovarian cancer classification.Methods: This study use Classification and Regression Tree (CART) as the classifier. CART optimized by pruning. CART optimal is determined from the results of CART complexity analysis and confusion matrix.Results: The confusion matrix and CART interpretations in this research show that CART with low complexity is still able to predict majority class respondents well. However, when none of the data in the minority class was classified correctly, the accuracy value was still quite high, namely 86.97% and 88.03% respectively at the training and testing stages, while the BA value at both stages was only 50%.Novelty: It is very important to ensure that the evaluation metrics used match the characteristics of the data being processed. This research illustrate the difference between accuracy and BA. It concluded that that classification of an imbalanced dataset without doing resampling can use BA as evaluation metric, because based on the results, BA is more fairly to both classes.","PeriodicalId":30781,"journal":{"name":"Scientific Journal of Informatics","volume":"4 8","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Knowledge Discovery from Confusion Matrix of Pruned CART in Imbalanced Microarray Data Ovarian Cancer Classification\",\"authors\":\"Ni Kadek Emik Sapitri, Umu Sa’adah, Nur Shofianah\",\"doi\":\"10.15294/sji.v11i1.50077\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: The results of microarray data analysis is important in cancer diagnosis, especially in early stages asymptomatic cancers like ovarian cancer. One of the challenges in analyzing microarray data is the problem of imbalanced data. Unfortunately, research that carries out cancer classification from microarray data often ignores this challenge, so that it doesn’t use appropriate evaluation metrics. It makes the results biased towards the majority class. This study uses a popular evaluation metric “accuracy” and an evaluation metric that is suitable for imbalanced data “balanced accuracy (BA)” to gain information from the confusion matrix regarding accuracy and BA values in case of ovarian cancer classification.Methods: This study use Classification and Regression Tree (CART) as the classifier. CART optimized by pruning. CART optimal is determined from the results of CART complexity analysis and confusion matrix.Results: The confusion matrix and CART interpretations in this research show that CART with low complexity is still able to predict majority class respondents well. However, when none of the data in the minority class was classified correctly, the accuracy value was still quite high, namely 86.97% and 88.03% respectively at the training and testing stages, while the BA value at both stages was only 50%.Novelty: It is very important to ensure that the evaluation metrics used match the characteristics of the data being processed. This research illustrate the difference between accuracy and BA. It concluded that that classification of an imbalanced dataset without doing resampling can use BA as evaluation metric, because based on the results, BA is more fairly to both classes.\",\"PeriodicalId\":30781,\"journal\":{\"name\":\"Scientific Journal of Informatics\",\"volume\":\"4 8\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scientific Journal of Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15294/sji.v11i1.50077\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Journal of Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15294/sji.v11i1.50077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
目的:微阵列数据分析结果对癌症诊断非常重要,尤其是卵巢癌等早期无症状癌症。分析微阵列数据的挑战之一是不平衡数据问题。遗憾的是,根据微阵列数据进行癌症分类的研究往往忽视了这一挑战,因此没有使用适当的评估指标。这使得结果偏向于多数类。本研究使用流行的评价指标 "准确率 "和适用于不平衡数据的评价指标 "平衡准确率(BA)",从混淆矩阵中获取有关卵巢癌分类中准确率和平衡准确率值的信息:本研究使用分类回归树(CART)作为分类器。通过剪枝优化 CART。根据 CART 复杂性分析和混淆矩阵的结果确定 CART 最佳值:本研究中的混淆矩阵和 CART 解释表明,低复杂度的 CART 仍能很好地预测大多数类别的受访者。新颖性:确保所使用的评价指标与所处理数据的特征相匹配非常重要。这项研究说明了准确率和 BA 之间的区别。研究得出结论,在不进行重采样的情况下对不平衡数据集进行分类,可以使用 BA 作为评价指标,因为根据结果,BA 对两类数据都更公平。
Knowledge Discovery from Confusion Matrix of Pruned CART in Imbalanced Microarray Data Ovarian Cancer Classification
Purpose: The results of microarray data analysis is important in cancer diagnosis, especially in early stages asymptomatic cancers like ovarian cancer. One of the challenges in analyzing microarray data is the problem of imbalanced data. Unfortunately, research that carries out cancer classification from microarray data often ignores this challenge, so that it doesn’t use appropriate evaluation metrics. It makes the results biased towards the majority class. This study uses a popular evaluation metric “accuracy” and an evaluation metric that is suitable for imbalanced data “balanced accuracy (BA)” to gain information from the confusion matrix regarding accuracy and BA values in case of ovarian cancer classification.Methods: This study use Classification and Regression Tree (CART) as the classifier. CART optimized by pruning. CART optimal is determined from the results of CART complexity analysis and confusion matrix.Results: The confusion matrix and CART interpretations in this research show that CART with low complexity is still able to predict majority class respondents well. However, when none of the data in the minority class was classified correctly, the accuracy value was still quite high, namely 86.97% and 88.03% respectively at the training and testing stages, while the BA value at both stages was only 50%.Novelty: It is very important to ensure that the evaluation metrics used match the characteristics of the data being processed. This research illustrate the difference between accuracy and BA. It concluded that that classification of an imbalanced dataset without doing resampling can use BA as evaluation metric, because based on the results, BA is more fairly to both classes.