The Effects of Over and Under Sampling on Fault-prone Module Detection

Yasutaka Kamei, Akito Monden, S. Matsumoto, Takeshi Kakimoto, Ken-ichi Matsumoto
{"title":"The Effects of Over and Under Sampling on Fault-prone Module Detection","authors":"Yasutaka Kamei, Akito Monden, S. Matsumoto, Takeshi Kakimoto, Ken-ichi Matsumoto","doi":"10.1109/ESEM.2007.28","DOIUrl":null,"url":null,"abstract":"The goal of this paper is to improve the prediction performance of fault-prone module prediction models (fault-proneness models) by employing over/under sampling methods, which are preprocessing procedures for a fit dataset. The sampling methods are expected to improve prediction performance when the fit dataset is unbalanced, i.e. there exists a large difference between the number of fault-prone modules and not-fault-prone modules. So far, there has been no research reporting the effects of applying sampling methods to fault-proneness models. In this paper, we experimentally evaluated the effects of four sampling methods (random over sampling, synthetic minority over sampling, random under sampling and one-sided selection) applied to four fault-proneness models (linear discriminant analysis, logistic regression analysis, neural network and classification tree) by using two module sets of industry legacy software. All four sampling methods improved the prediction performance of the linear and logistic models, while neural network and classification tree models did not benefit from the sampling methods. The improvements of Fl-values in linear and logistic models were 0.078 at minimum, 0.224 at maximum and 0.121 at the mean.","PeriodicalId":124420,"journal":{"name":"First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"148","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESEM.2007.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 148

Abstract

The goal of this paper is to improve the prediction performance of fault-prone module prediction models (fault-proneness models) by employing over/under sampling methods, which are preprocessing procedures for a fit dataset. The sampling methods are expected to improve prediction performance when the fit dataset is unbalanced, i.e. there exists a large difference between the number of fault-prone modules and not-fault-prone modules. So far, there has been no research reporting the effects of applying sampling methods to fault-proneness models. In this paper, we experimentally evaluated the effects of four sampling methods (random over sampling, synthetic minority over sampling, random under sampling and one-sided selection) applied to four fault-proneness models (linear discriminant analysis, logistic regression analysis, neural network and classification tree) by using two module sets of industry legacy software. All four sampling methods improved the prediction performance of the linear and logistic models, while neural network and classification tree models did not benefit from the sampling methods. The improvements of Fl-values in linear and logistic models were 0.078 at minimum, 0.224 at maximum and 0.121 at the mean.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
过采样和欠采样对易故障模块检测的影响
本文的目标是通过采用过采样/欠采样方法(拟合数据集的预处理过程)来提高故障倾向模块预测模型(故障倾向模型)的预测性能。当拟合数据不平衡,即易故障模块和非易故障模块的数量存在较大差异时,期望采用采样方法提高预测性能。到目前为止,还没有研究报告将抽样方法应用于断层倾向模型的效果。本文利用工业遗留软件的两个模块集,对四种故障倾向模型(线性判别分析、逻辑回归分析、神经网络和分类树)采用随机过抽样、合成少数过抽样、随机欠抽样和单侧选择四种抽样方法的效果进行了实验评价。这四种抽样方法都能提高线性和逻辑模型的预测性能,而神经网络和分类树模型没有从抽样方法中受益。线性和logistic模型的l-值改善最小值为0.078,最大值为0.224,平均值为0.121。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Comparing Model Generated with Expert Generated IV&V Activity Plans Decision Support with EMPEROR A cost effectiveness indicator for software development Fine-Grained Software Metrics in Practice Automated Information Extraction from Empirical Software Engineering Literature: Is that possible?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1