一种新的过采样方法及其在基因表达数据癌症分类中的应用

IF 0.8 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Chem-Bio Informatics Journal Pub Date : 2013-01-01 DOI:10.1273/CBIJ.13.19

Xuan Tho Dang, Osamu Hirose, Duong Hung Bui, Thammakorn Saethang, Vu Anh Tran, L. A. T. Nguyen, T. K. T. Le, Mamoru Kubo, Yoichi Yamada, K. Satou

{"title":"一种新的过采样方法及其在基因表达数据癌症分类中的应用","authors":"Xuan Tho Dang, Osamu Hirose, Duong Hung Bui, Thammakorn Saethang, Vu Anh Tran, L. A. T. Nguyen, T. K. T. Le, Mamoru Kubo, Yoichi Yamada, K. Satou","doi":"10.1273/CBIJ.13.19","DOIUrl":null,"url":null,"abstract":"One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"4 1","pages":"19-29"},"PeriodicalIF":0.8000,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data\",\"authors\":\"Xuan Tho Dang, Osamu Hirose, Duong Hung Bui, Thammakorn Saethang, Vu Anh Tran, L. A. T. Nguyen, T. K. T. Le, Mamoru Kubo, Yoichi Yamada, K. Satou\",\"doi\":\"10.1273/CBIJ.13.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.\",\"PeriodicalId\":40659,\"journal\":{\"name\":\"Chem-Bio Informatics Journal\",\"volume\":\"4 1\",\"pages\":\"19-29\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2013-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Chem-Bio Informatics Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1273/CBIJ.13.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chem-Bio Informatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1273/CBIJ.13.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 8

摘要

生物医学数据分类中最关键和最常见的问题之一是类别分布不平衡，即多数类别的样本数量明显超过少数类别。SMOTE是一种众所周知的通用过采样方法，用于解决这个问题;然而，在某些情况下，它不能提高甚至降低分类性能。为了解决这些问题，我们开发了一种新的少数派过采样方法，命名为safe-SMOTE。两个用于癌症分类的基因表达数据集(即结肠癌和白血病)和来自UCI机器学习存储库的六个不平衡基准数据集的实验结果表明，我们的方法比对照方法(即无过采样)和SMOTE方法获得了更好的灵敏度和g均值。例如，在结肠癌数据集中，尽管SMOTE的敏感性和特异性(81.36%和88.63%)低于对照方法(81.59%和89.50%)，但安全SMOTE相比，这些值增加了(81.82%和90.50%)。同样，使用SMOTE时，对照组的g -平均值(85.45%)下降到84.91%，而使用安全SMOTE时，g -平均值上升到86.04%。在白血病数据集中，SMOTE能够提高相对于对照组的灵敏度和g均值;然而，safe-SMOTE在这两个标准上都取得了显著的、甚至更大的进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data

One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Chem-Bio Informatics Journal BIOCHEMISTRY & MOLECULAR BIOLOGY-

CiteScore

0.60

自引率

0.00%

发文量