Siyu Jiang, Yuwen Chen, Zhenhang He, Yunpeng Shang, Le Ma
{"title":"通过语义和句法编码进行跨项目缺陷预测","authors":"Siyu Jiang, Yuwen Chen, Zhenhang He, Yunpeng Shang, Le Ma","doi":"10.1007/s10664-024-10495-z","DOIUrl":null,"url":null,"abstract":"<p>Cross-Project Defect Prediction (CPDP) is a promising research field that focuses on detecting defects in projects with limited labeled data by utilizing prediction models trained on projects with abundant data. However, previous CPDP approaches based on Abstract Syntax Tree (AST) have often encountered challenges in effectively acquiring semantic and syntactic information, resulting in their limited ability to combine both productively. This issue arises primarily from the practice of flattening the AST into a linear sequence in many AST-based methods, leading to the loss of hierarchical syntactic structure and structural information within the code. Besides, other AST-based methods use a recursive way to traverse the tree-structured AST, which is susceptible to gradient vanishing. To alleviate these concerns, we introduce a novel CPDP method named defect prediction via Semantic and Syntactic Encoding (SSE) that enhances Zhang’s approach by encoding semantic and syntactic information while retaining and considering AST structure. Specifically, we perform pre-training on a large corpus using a language model to learn semantic information. Next, we present a new rule for splitting AST into subtrees to avoid vanishing gradients. Then, the absolute paths originating from the root node and leading to the leaf nodes are encoded as hierarchical syntactic information. Finally, we design an encoder to integrate syntactic information into semantic information and leverage Bi-directional Long-Short Term Memory to learn the entire tree representation for prediction. Experimental results on 12 benchmark projects illustrate that the SSE method we proposed surpasses current state-of-the-art methods.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"26 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross-project defect prediction via semantic and syntactic encoding\",\"authors\":\"Siyu Jiang, Yuwen Chen, Zhenhang He, Yunpeng Shang, Le Ma\",\"doi\":\"10.1007/s10664-024-10495-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Cross-Project Defect Prediction (CPDP) is a promising research field that focuses on detecting defects in projects with limited labeled data by utilizing prediction models trained on projects with abundant data. However, previous CPDP approaches based on Abstract Syntax Tree (AST) have often encountered challenges in effectively acquiring semantic and syntactic information, resulting in their limited ability to combine both productively. This issue arises primarily from the practice of flattening the AST into a linear sequence in many AST-based methods, leading to the loss of hierarchical syntactic structure and structural information within the code. Besides, other AST-based methods use a recursive way to traverse the tree-structured AST, which is susceptible to gradient vanishing. To alleviate these concerns, we introduce a novel CPDP method named defect prediction via Semantic and Syntactic Encoding (SSE) that enhances Zhang’s approach by encoding semantic and syntactic information while retaining and considering AST structure. Specifically, we perform pre-training on a large corpus using a language model to learn semantic information. Next, we present a new rule for splitting AST into subtrees to avoid vanishing gradients. Then, the absolute paths originating from the root node and leading to the leaf nodes are encoded as hierarchical syntactic information. Finally, we design an encoder to integrate syntactic information into semantic information and leverage Bi-directional Long-Short Term Memory to learn the entire tree representation for prediction. Experimental results on 12 benchmark projects illustrate that the SSE method we proposed surpasses current state-of-the-art methods.</p>\",\"PeriodicalId\":11525,\"journal\":{\"name\":\"Empirical Software Engineering\",\"volume\":\"26 1\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Empirical Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10664-024-10495-z\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10495-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
Cross-project defect prediction via semantic and syntactic encoding
Cross-Project Defect Prediction (CPDP) is a promising research field that focuses on detecting defects in projects with limited labeled data by utilizing prediction models trained on projects with abundant data. However, previous CPDP approaches based on Abstract Syntax Tree (AST) have often encountered challenges in effectively acquiring semantic and syntactic information, resulting in their limited ability to combine both productively. This issue arises primarily from the practice of flattening the AST into a linear sequence in many AST-based methods, leading to the loss of hierarchical syntactic structure and structural information within the code. Besides, other AST-based methods use a recursive way to traverse the tree-structured AST, which is susceptible to gradient vanishing. To alleviate these concerns, we introduce a novel CPDP method named defect prediction via Semantic and Syntactic Encoding (SSE) that enhances Zhang’s approach by encoding semantic and syntactic information while retaining and considering AST structure. Specifically, we perform pre-training on a large corpus using a language model to learn semantic information. Next, we present a new rule for splitting AST into subtrees to avoid vanishing gradients. Then, the absolute paths originating from the root node and leading to the leaf nodes are encoded as hierarchical syntactic information. Finally, we design an encoder to integrate syntactic information into semantic information and leverage Bi-directional Long-Short Term Memory to learn the entire tree representation for prediction. Experimental results on 12 benchmark projects illustrate that the SSE method we proposed surpasses current state-of-the-art methods.
期刊介绍:
Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories.
The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings.
Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.