{"title":"S2 LMMD: Cross-Project Software Defect Prediction via Statement Semantic Learning and Maximum Mean Discrepancy","authors":"Wangshu Liu, Yongteng Zhu, Xiang Chen, Qing Gu, Xingya Wang, Shenkai Gu","doi":"10.1109/APSEC53868.2021.00044","DOIUrl":null,"url":null,"abstract":"Different from within-project software defect prediction (WPDP), cross-project software defect prediction (CPDP) does not require sufficient training data and can help developers in the early stages of software development. Recent studies tried to learn semantic features for CPDP by feeding neural networks with abstract syntax tree (AST) token vectors. However, the ASTs directly parsed from software modules usually have complex structures, which are reflected on more nodes and deeper size, and the transfer learning is not regularly adopted to further reduce the data distribution difference between the source project and the target project. To solve these problems, we aim to joint learn the statement level trees (SLT) and alleviate data distribution difference with maximum mean discrepancy (MMD) to improve defect prediction performance on CPDP. Specifically, we propose a novel cross-project defect prediction method S2LMMD via statement semantic learning and MMD. We first construct the SLT by splitting the original AST on specified node. Then we generate more effective semantic features by learning of sequence embedding with Bi-GRU neural network. Finally, a transfer loss MMD is carried out to keep more common characteristics across different project datasets to further improve CPDP performance. To verify the effectiveness of our proposed method, we conducted experiments on ten widely used open-source projects and evaluated the experimental performance by using AUC measures. Our empirical results show that our proposed method S2LMMD can significantly outperform eight state-of-the-art baselines. In addition, for semantic learning, SLT has a higher influence on CPDP, while MMD is of great significance in transfer learning.","PeriodicalId":143800,"journal":{"name":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSEC53868.2021.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Different from within-project software defect prediction (WPDP), cross-project software defect prediction (CPDP) does not require sufficient training data and can help developers in the early stages of software development. Recent studies tried to learn semantic features for CPDP by feeding neural networks with abstract syntax tree (AST) token vectors. However, the ASTs directly parsed from software modules usually have complex structures, which are reflected on more nodes and deeper size, and the transfer learning is not regularly adopted to further reduce the data distribution difference between the source project and the target project. To solve these problems, we aim to joint learn the statement level trees (SLT) and alleviate data distribution difference with maximum mean discrepancy (MMD) to improve defect prediction performance on CPDP. Specifically, we propose a novel cross-project defect prediction method S2LMMD via statement semantic learning and MMD. We first construct the SLT by splitting the original AST on specified node. Then we generate more effective semantic features by learning of sequence embedding with Bi-GRU neural network. Finally, a transfer loss MMD is carried out to keep more common characteristics across different project datasets to further improve CPDP performance. To verify the effectiveness of our proposed method, we conducted experiments on ten widely used open-source projects and evaluated the experimental performance by using AUC measures. Our empirical results show that our proposed method S2LMMD can significantly outperform eight state-of-the-art baselines. In addition, for semantic learning, SLT has a higher influence on CPDP, while MMD is of great significance in transfer learning.