The Effects of Over and Under Sampling on Fault-prone Module Detection

First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007) Pub Date : 2007-09-20 DOI:10.1109/ESEM.2007.28

Yasutaka Kamei, Akito Monden, S. Matsumoto, Takeshi Kakimoto, Ken-ichi Matsumoto

{"title":"The Effects of Over and Under Sampling on Fault-prone Module Detection","authors":"Yasutaka Kamei, Akito Monden, S. Matsumoto, Takeshi Kakimoto, Ken-ichi Matsumoto","doi":"10.1109/ESEM.2007.28","DOIUrl":null,"url":null,"abstract":"The goal of this paper is to improve the prediction performance of fault-prone module prediction models (fault-proneness models) by employing over/under sampling methods, which are preprocessing procedures for a fit dataset. The sampling methods are expected to improve prediction performance when the fit dataset is unbalanced, i.e. there exists a large difference between the number of fault-prone modules and not-fault-prone modules. So far, there has been no research reporting the effects of applying sampling methods to fault-proneness models. In this paper, we experimentally evaluated the effects of four sampling methods (random over sampling, synthetic minority over sampling, random under sampling and one-sided selection) applied to four fault-proneness models (linear discriminant analysis, logistic regression analysis, neural network and classification tree) by using two module sets of industry legacy software. All four sampling methods improved the prediction performance of the linear and logistic models, while neural network and classification tree models did not benefit from the sampling methods. The improvements of Fl-values in linear and logistic models were 0.078 at minimum, 0.224 at maximum and 0.121 at the mean.","PeriodicalId":124420,"journal":{"name":"First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"148","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESEM.2007.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 148

Abstract

The goal of this paper is to improve the prediction performance of fault-prone module prediction models (fault-proneness models) by employing over/under sampling methods, which are preprocessing procedures for a fit dataset. The sampling methods are expected to improve prediction performance when the fit dataset is unbalanced, i.e. there exists a large difference between the number of fault-prone modules and not-fault-prone modules. So far, there has been no research reporting the effects of applying sampling methods to fault-proneness models. In this paper, we experimentally evaluated the effects of four sampling methods (random over sampling, synthetic minority over sampling, random under sampling and one-sided selection) applied to four fault-proneness models (linear discriminant analysis, logistic regression analysis, neural network and classification tree) by using two module sets of industry legacy software. All four sampling methods improved the prediction performance of the linear and logistic models, while neural network and classification tree models did not benefit from the sampling methods. The improvements of Fl-values in linear and logistic models were 0.078 at minimum, 0.224 at maximum and 0.121 at the mean.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

过采样和欠采样对易故障模块检测的影响

本文的目标是通过采用过采样/欠采样方法(拟合数据集的预处理过程)来提高故障倾向模块预测模型(故障倾向模型)的预测性能。当拟合数据不平衡，即易故障模块和非易故障模块的数量存在较大差异时，期望采用采样方法提高预测性能。到目前为止，还没有研究报告将抽样方法应用于断层倾向模型的效果。本文利用工业遗留软件的两个模块集，对四种故障倾向模型(线性判别分析、逻辑回归分析、神经网络和分类树)采用随机过抽样、合成少数过抽样、随机欠抽样和单侧选择四种抽样方法的效果进行了实验评价。这四种抽样方法都能提高线性和逻辑模型的预测性能，而神经网络和分类树模型没有从抽样方法中受益。线性和logistic模型的l-值改善最小值为0.078，最大值为0.224，平均值为0.121。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)

自引率

0.00%

发文量

期刊最新文献

Comparing Model Generated with Expert Generated IV&V Activity Plans Decision Support with EMPEROR A cost effectiveness indicator for software development Fine-Grained Software Metrics in Practice Automated Information Extraction from Empirical Software Engineering Literature: Is that possible?