A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data

IF 0.4 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Chem-Bio Informatics Journal Pub Date : 2013-01-01 DOI:10.1273/CBIJ.13.19
Xuan Tho Dang, Osamu Hirose, Duong Hung Bui, Thammakorn Saethang, Vu Anh Tran, L. A. T. Nguyen, T. K. T. Le, Mamoru Kubo, Yoichi Yamada, K. Satou
{"title":"A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data","authors":"Xuan Tho Dang, Osamu Hirose, Duong Hung Bui, Thammakorn Saethang, Vu Anh Tran, L. A. T. Nguyen, T. K. T. Le, Mamoru Kubo, Yoichi Yamada, K. Satou","doi":"10.1273/CBIJ.13.19","DOIUrl":null,"url":null,"abstract":"One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":null,"pages":null},"PeriodicalIF":0.4000,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chem-Bio Informatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1273/CBIJ.13.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 8

Abstract

One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一种新的过采样方法及其在基因表达数据癌症分类中的应用
生物医学数据分类中最关键和最常见的问题之一是类别分布不平衡,即多数类别的样本数量明显超过少数类别。SMOTE是一种众所周知的通用过采样方法,用于解决这个问题;然而,在某些情况下,它不能提高甚至降低分类性能。为了解决这些问题,我们开发了一种新的少数派过采样方法,命名为safe-SMOTE。两个用于癌症分类的基因表达数据集(即结肠癌和白血病)和来自UCI机器学习存储库的六个不平衡基准数据集的实验结果表明,我们的方法比对照方法(即无过采样)和SMOTE方法获得了更好的灵敏度和g均值。例如,在结肠癌数据集中,尽管SMOTE的敏感性和特异性(81.36%和88.63%)低于对照方法(81.59%和89.50%),但安全SMOTE相比,这些值增加了(81.82%和90.50%)。同样,使用SMOTE时,对照组的g -平均值(85.45%)下降到84.91%,而使用安全SMOTE时,g -平均值上升到86.04%。在白血病数据集中,SMOTE能够提高相对于对照组的灵敏度和g均值;然而,safe-SMOTE在这两个标准上都取得了显著的、甚至更大的进步。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Chem-Bio Informatics Journal
Chem-Bio Informatics Journal BIOCHEMISTRY & MOLECULAR BIOLOGY-
CiteScore
0.60
自引率
0.00%
发文量
8
期刊最新文献
Structural Stability and Binding Ability of SARS-CoV-2 Main Protease with GC376: A Stereoisomeric Covalent Ligand Analysis by FMO calculation Enzyme Kinetics Based on the Concept of Flux Enzyme Kinetics Based on the Concept of Flux Application of Model Core Potentials to Zn- and Mg-containing Metalloproteins in the Fragment Molecular Orbital Method How Beneficial or Threatening is Artificial Intelligence?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1