{"title":"An Under-sampling Algorithm Based on Weighted Complexity and Its Application in Software Defect Prediction","authors":"Wei Wei, Feng Jiang, Xu Yu, Junwei Du","doi":"10.1145/3520084.3520091","DOIUrl":null,"url":null,"abstract":"The under-sampling technique is an important method to solve the class imbalance issue in software defect prediction. However, the existing under-sampling methods generally ignore the problem that there are great differences in the complexities of different samples. In fact, the complexities of samples can play an important role in defect prediction, since there is a close relation between the complexities of samples and whether they have defects. Therefore, when we use the under-sampling technique to handle the class imbalance issue in software defect prediction, it is necessary to consider the complexities of samples. In this paper, we propose the notion of weighted complexity. When calculating the weighted complexity of each sample, the weights of different condition attributes are considered. Based on the weighted complexity, we propose a new under-sampling algorithm, called WCP-UnderSampler, and apply it to software defect prediction. In WCP-UnderSampler, we first employ the granularity decision entropy in rough sets to calculate the significance and the weight of each condition attribute; Second, the weighted complexity of each sample is obtained by calculating the weighted sum of the values of the sample on all attributes; Third, the majority class samples are sorted in descending order according to their weighted complexities, and the majority class samples with higher complexities are selected until a balanced data set is obtained. Experiments on defect prediction data sets show that we can obtain better software defect prediction results by using WCP-UnderSampler to handle the imbalanced data.","PeriodicalId":444957,"journal":{"name":"Proceedings of the 2022 5th International Conference on Software Engineering and Information Management","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Software Engineering and Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3520084.3520091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The under-sampling technique is an important method to solve the class imbalance issue in software defect prediction. However, the existing under-sampling methods generally ignore the problem that there are great differences in the complexities of different samples. In fact, the complexities of samples can play an important role in defect prediction, since there is a close relation between the complexities of samples and whether they have defects. Therefore, when we use the under-sampling technique to handle the class imbalance issue in software defect prediction, it is necessary to consider the complexities of samples. In this paper, we propose the notion of weighted complexity. When calculating the weighted complexity of each sample, the weights of different condition attributes are considered. Based on the weighted complexity, we propose a new under-sampling algorithm, called WCP-UnderSampler, and apply it to software defect prediction. In WCP-UnderSampler, we first employ the granularity decision entropy in rough sets to calculate the significance and the weight of each condition attribute; Second, the weighted complexity of each sample is obtained by calculating the weighted sum of the values of the sample on all attributes; Third, the majority class samples are sorted in descending order according to their weighted complexities, and the majority class samples with higher complexities are selected until a balanced data set is obtained. Experiments on defect prediction data sets show that we can obtain better software defect prediction results by using WCP-UnderSampler to handle the imbalanced data.