Pub Date : 2025-12-04DOI: 10.1109/TSE.2025.3640123
Jiaxuan Han;Cheng Huang;Jiayong Liu;Tianwei Zhang
Containerization is the mainstream of current software development, which enables software to be used across platforms without additional configuration of running environment. However, many images created by developers are redundant and contain unnecessary code, packages, and components. This excess not only leads to bloated images that are cumbersome to transmit and store but also increases the attack surface, making them more vulnerable to security threats. Therefore, image slimming has emerged as a significant area of interest. Nevertheless, existing image slimming technologies face challenges, particularly regarding the incomplete extraction of environment dependencies required by project code. In this paper, we present a novel image slimming model named $delta$-SCALPEL. This model employs static data dependency analysis to extract the environment dependencies of the project code and utilizes a directed graph named command link directed graph for modeling the image’s file system. We select 30 NPM projects and two official Docker Hub images to construct a dataset for evaluating $delta$-SCALPEL. The evaluation results show that $delta$-SCALPEL is robust and can reduce image sizes by up to 61.4% while ensuring the normal operation of these projects.
{"title":"δ-SCALPEL: Docker Image Slimming Based on Source Code Static Analysis","authors":"Jiaxuan Han;Cheng Huang;Jiayong Liu;Tianwei Zhang","doi":"10.1109/TSE.2025.3640123","DOIUrl":"https://doi.org/10.1109/TSE.2025.3640123","url":null,"abstract":"Containerization is the mainstream of current software development, which enables software to be used across platforms without additional configuration of running environment. However, many images created by developers are redundant and contain unnecessary code, packages, and components. This excess not only leads to bloated images that are cumbersome to transmit and store but also increases the attack surface, making them more vulnerable to security threats. Therefore, image slimming has emerged as a significant area of interest. Nevertheless, existing image slimming technologies face challenges, particularly regarding the incomplete extraction of environment dependencies required by project code. In this paper, we present a novel image slimming model named <inline-formula><tex-math>$delta$</tex-math></inline-formula>-SCALPEL. This model employs static data dependency analysis to extract the environment dependencies of the project code and utilizes a directed graph named command link directed graph for modeling the image’s file system. We select 30 NPM projects and two official Docker Hub images to construct a dataset for evaluating <inline-formula><tex-math>$delta$</tex-math></inline-formula>-SCALPEL. The evaluation results show that <inline-formula><tex-math>$delta$</tex-math></inline-formula>-SCALPEL is robust and can reduce image sizes by up to 61.4% while ensuring the normal operation of these projects.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"562-577"},"PeriodicalIF":5.6,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146162235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TSE.2025.3637777
Shanquan Gao;Yihui Wang;Liyuan Tan;Zhenwei Ou;Xun Li
The Android platform provides a series of animation APIs, with which app developers can improve the implementation efficiency of UI animations—specifically, reducing the effort and time required to implement them. To assist app developers in quickly finding the suitable animation APIs, we have proposed two recommendation models called Animation2API and U-A2A. Animation2API has the capability to generate a list of available animation APIs for the UI animation task using the collaborative filtering algorithm. In contrast, U-A2A can encode both the animation API context and the UI animation task, and then predict the next animation API for the current animation implementation based on the joint encoding of the two modalities. Since U-A2A can provide real-time recommendations throughout the process of animation implementation, it is effective in assisting developers in using animation API resources. Nevertheless, U-A2A has three key limitations. First, its GRU encoder for the animation API context has difficulty in adequately capturing the long-distance dependencies and the global information. Second, its 3D CNN encoder for the UI animation task fails to effectively extract the long-distance dependencies between video frames and the spatiotemporal features at different scales. Third, U-A2A consistently treats the two modalities equally when fusing their encodings, despite the need to adaptively adjust their contribution levels according to the actual situation. To address these limitations, the paper introduces a novel animation API recommendation model named AC2Next. AC2Next adopts an encoder component based on the self-attention mechanism to encode the animation API context and the UI animation task. Specifically, it uses GRU with the self-attention mechanism as the encoder of the animation API context and applies ViViT, a Transformer architecture with self-attention mechanisms, to encode the UI animation task. Meanwhile, AC2Next utilizes its adaptive weight layer to assign appropriate weights to the animation API context and the UI animation task during the information fusion process. The experimental results show that AC2Next can outperform U-A2A in any stage of the animation implementation. When considering 1, 3, 5, and 10 animation APIs, AC2Next achieves an improvement of 31.56%, 10.01%, 5.57%, and 3.34% respectively in recommendation accuracy compared to U-A2A.
{"title":"AC2Next: A Novel Model That Can Predict the Next Animation API by Fusing the Animation API Context and the UI Animation Task","authors":"Shanquan Gao;Yihui Wang;Liyuan Tan;Zhenwei Ou;Xun Li","doi":"10.1109/TSE.2025.3637777","DOIUrl":"10.1109/TSE.2025.3637777","url":null,"abstract":"The Android platform provides a series of animation APIs, with which app developers can improve the implementation efficiency of UI animations—specifically, reducing the effort and time required to implement them. To assist app developers in quickly finding the suitable animation APIs, we have proposed two recommendation models called Animation2API and U-A2A. Animation2API has the capability to generate a list of available animation APIs for the UI animation task using the collaborative filtering algorithm. In contrast, U-A2A can encode both the animation API context and the UI animation task, and then predict the next animation API for the current animation implementation based on the joint encoding of the two modalities. Since U-A2A can provide real-time recommendations throughout the process of animation implementation, it is effective in assisting developers in using animation API resources. Nevertheless, U-A2A has three key limitations. First, its GRU encoder for the animation API context has difficulty in adequately capturing the long-distance dependencies and the global information. Second, its 3D CNN encoder for the UI animation task fails to effectively extract the long-distance dependencies between video frames and the spatiotemporal features at different scales. Third, U-A2A consistently treats the two modalities equally when fusing their encodings, despite the need to adaptively adjust their contribution levels according to the actual situation. To address these limitations, the paper introduces a novel animation API recommendation model named AC2Next. AC2Next adopts an encoder component based on the self-attention mechanism to encode the animation API context and the UI animation task. Specifically, it uses GRU with the self-attention mechanism as the encoder of the animation API context and applies ViViT, a Transformer architecture with self-attention mechanisms, to encode the UI animation task. Meanwhile, AC2Next utilizes its adaptive weight layer to assign appropriate weights to the animation API context and the UI animation task during the information fusion process. The experimental results show that AC2Next can outperform U-A2A in any stage of the animation implementation. When considering 1, 3, 5, and 10 animation APIs, AC2Next achieves an improvement of 31.56%, 10.01%, 5.57%, and 3.34% respectively in recommendation accuracy compared to U-A2A.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"22-35"},"PeriodicalIF":5.6,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145664435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-02DOI: 10.1109/TSE.2025.3638998
Yuan Jiang;Shan Huang;Christoph Treude;Xiaohong Su;Tiantian Wang
Vulnerability detection is critical for ensuring software security. Although deep learning (DL) methods, particularly those employing large language models (LLMs), have shown strong performance in automating vulnerability identification, they remain susceptible to adversarial examples, which are carefully crafted inputs with subtle perturbations designed to evade detection. Existing adversarial attack methods often require access to model architectures or confidence scores, making them impractical for real-world black-box systems. In this paper, we propose SVulAttack, a novel label-only adversarial attack framework targeting LLM-based vulnerability detectors. Our key innovation lies in a similarity-based strategy that estimates statement importance and model confidence, thereby enabling more effective selection of semantic-preserving code perturbations. SVulAttack combines this strategy with a transformation component and a search component, based on either greedy or genetic algorithms, to effectively identify and apply optimal combinations of transformations. We evaluate SVulAttack on open-source models (LineVul, StagedVulBERT, Code Llama, Deepseek-Coder) and closed-source models (GPT-5 nano, GPT-4o, GPT-4o-mini, Claude Sonnet 4). Results show that SVulAttack significantly outperforms existing label-only black-box attack methods. For example, against LineVul, our method with genetic algorithm achieves an attack success rate of 49.0%, improving over DIP and CODA by 150.0% and 240.3%, respectively.
{"title":"Shield Broken: Black-Box Adversarial Attacks on LLM-Based Vulnerability Detectors","authors":"Yuan Jiang;Shan Huang;Christoph Treude;Xiaohong Su;Tiantian Wang","doi":"10.1109/TSE.2025.3638998","DOIUrl":"10.1109/TSE.2025.3638998","url":null,"abstract":"Vulnerability detection is critical for ensuring software security. Although deep learning (DL) methods, particularly those employing large language models (LLMs), have shown strong performance in automating vulnerability identification, they remain susceptible to adversarial examples, which are carefully crafted inputs with subtle perturbations designed to evade detection. Existing adversarial attack methods often require access to model architectures or confidence scores, making them impractical for real-world black-box systems. In this paper, we propose SVulAttack, a novel label-only adversarial attack framework targeting LLM-based vulnerability detectors. Our key innovation lies in a similarity-based strategy that estimates statement importance and model confidence, thereby enabling more effective selection of semantic-preserving code perturbations. SVulAttack combines this strategy with a transformation component and a search component, based on either greedy or genetic algorithms, to effectively identify and apply optimal combinations of transformations. We evaluate SVulAttack on open-source models (LineVul, StagedVulBERT, Code Llama, Deepseek-Coder) and closed-source models (GPT-5 nano, GPT-4o, GPT-4o-mini, Claude Sonnet 4). Results show that SVulAttack significantly outperforms existing label-only black-box attack methods. For example, against LineVul, our method with genetic algorithm achieves an attack success rate of 49.0%, improving over DIP and CODA by 150.0% and 240.3%, respectively.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"246-265"},"PeriodicalIF":5.6,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145664434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Compared to Full-Model Fine-Tuning (FMFT), Parameter-Efficient Fine-Tuning (PEFT) has demonstrated superior efficacy and efficiency in several code understanding tasks, owing to PEFT’s ability to alleviate the catastrophic forgetting issue of Pre-trained Language Models (PLMs) by updating only a small number of parameters. However, existing studies primarily involve static code comprehension, aligning with the pre-training paradigm of recent PLMs and facilitating knowledge transfer, but they do not account for dynamic code changes. Thus, it remains unclear whether PEFT outperforms FMFT in task-specific adaptation for code-change-related tasks. To address this question, we examine four prevalent PEFT methods (i.e., AT, LoRA, PT, and PreT) and compare their performance with FMFT across seven popular PLMs. In experiments, two widely studied code-change-related tasks, i.e., Just-In-Time Defect Prediction (JIT-DP) and Commit Message Generation (CMG) are involved, demonstrating that the four PEFT methods can surpass FMFT on JIT-DP but only exhibit comparable performances at best on CMG in common scenarios. While in cross-lingual and low-resource scenarios, they exhibit relative superiority. Afterward, a series of probing tasks from both static and dynamic perspectives are conducted in this paper, offering detailed explanations for the efficacy of PEFT and FMFT. Inspired by the distinctive advantages of PEFT and FMFT in their layer-wise probing results, we propose Pasta${}_{K}$, a self-adaPtive efficient layer-specific tuning framework for PLMs in code change learning, which combines FMFT and PEFT during the domain adaptation according to the guidance of probing results. Experiments in the CMG task demonstrate that Pasta${}_{K}$ surpasses diverse PEFT methods in effectiveness. Even, Pasta${}_{K}$ outperforms FMFT by 1.48%, 3.21%, and 1.87% at most in terms of BLEU, Meteor, and Rouge-L, while saving 26.26% and 20.65% in terms of training time and computational memory compared with FMFT.
{"title":"An Empirical Study of Parameter-Efficient Fine-Tuning in Code Change Learning and Beyond","authors":"Shuo Liu;Jacky Keung;Zhi Jin;Zhen Yang;Fang Liu;Hao Zhang","doi":"10.1109/TSE.2025.3637335","DOIUrl":"10.1109/TSE.2025.3637335","url":null,"abstract":"Compared to Full-Model Fine-Tuning (FMFT), Parameter-Efficient Fine-Tuning (PEFT) has demonstrated superior efficacy and efficiency in several code understanding tasks, owing to PEFT’s ability to alleviate the catastrophic forgetting issue of Pre-trained Language Models (PLMs) by updating only a small number of parameters. However, existing studies primarily involve static code comprehension, aligning with the pre-training paradigm of recent PLMs and facilitating knowledge transfer, but they do not account for dynamic code changes. Thus, it remains unclear whether PEFT outperforms FMFT in task-specific adaptation for code-change-related tasks. To address this question, we examine four prevalent PEFT methods (i.e., AT, LoRA, PT, and PreT) and compare their performance with FMFT across seven popular PLMs. In experiments, two widely studied code-change-related tasks, i.e., Just-In-Time Defect Prediction (JIT-DP) and Commit Message Generation (CMG) are involved, demonstrating that the four PEFT methods can surpass FMFT on JIT-DP but only exhibit comparable performances at best on CMG in common scenarios. While in cross-lingual and low-resource scenarios, they exhibit relative superiority. Afterward, a series of probing tasks from both static and dynamic perspectives are conducted in this paper, offering detailed explanations for the efficacy of PEFT and FMFT. Inspired by the distinctive advantages of PEFT and FMFT in their layer-wise probing results, we propose Pasta<inline-formula><tex-math>${}_{K}$</tex-math></inline-formula>, a self-ada<u>P</u>tive efficient l<u>a</u>yer-<u>s</u>pecific <u>t</u>uning framework for PLMs in code ch<u>a</u>nge learning, which combines FMFT and PEFT during the domain adaptation according to the guidance of probing results. Experiments in the CMG task demonstrate that Pasta<inline-formula><tex-math>${}_{K}$</tex-math></inline-formula> surpasses diverse PEFT methods in effectiveness. Even, Pasta<inline-formula><tex-math>${}_{K}$</tex-math></inline-formula> outperforms FMFT by 1.48%, 3.21%, and 1.87% at most in terms of BLEU, Meteor, and Rouge-L, while saving 26.26% and 20.65% in terms of training time and computational memory compared with FMFT.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"3-21"},"PeriodicalIF":5.6,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145610935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1109/TSE.2025.3635158
Yifan An;Yunlong Ma;Xiang Gao;Hailong Sun
Code cloning is a common phenomenon in software development, which reduces developers’ programming efforts but also poses risks of defect inheritance. Clone detection locates exact or similar pieces of code within or between software systems. With the amount of source code increasing steadily, efficient and large-scale clone detection has become a necessity. Moreover, code clones may occur at various levels of code granularity, e.g., file, function, and block level, which pose more challenges for efficient clone detection. Although numerous methods have been proposed to detect code clones at different granularities, they often suffer from low detection efficiency, false positive results and are typically limited to identifying clones at a specific granularity. In this paper, we introduce an efficient clone detection, named MGCD, to detect code clones among large-scale codebases. Specifically, we embed function-level code into vectors using a pre-trained model and perform clustering search with the IVF_Flat algorithm to identify clone candidates. These candidates are then filtered through an entropy-based method to enhance accuracy and avoid false positive results. Moreover, we leverage the information from function-level clone detection results to further conduct file and block level clone detection. We evaluate our approach on the BigCloneBench benchmark. Experimental results show that our approach only takes 0.23 ms to search clone results among 800,000 functions and achieves high precision and recall.
{"title":"Scalable Large-Scale Multi-Granularity Code Clone Detection via Clustering Search and Pre-Trained Models","authors":"Yifan An;Yunlong Ma;Xiang Gao;Hailong Sun","doi":"10.1109/TSE.2025.3635158","DOIUrl":"10.1109/TSE.2025.3635158","url":null,"abstract":"Code cloning is a common phenomenon in software development, which reduces developers’ programming efforts but also poses risks of defect inheritance. Clone detection locates exact or similar pieces of code within or between software systems. With the amount of source code increasing steadily, efficient and large-scale clone detection has become a necessity. Moreover, code clones may occur at various levels of code granularity, e.g., file, function, and block level, which pose more challenges for efficient clone detection. Although numerous methods have been proposed to detect code clones at different granularities, they often suffer from low detection efficiency, false positive results and are typically limited to identifying clones at a specific granularity. In this paper, we introduce an efficient clone detection, named <sc>MGCD</small>, to detect code clones among large-scale codebases. Specifically, we embed function-level code into vectors using a pre-trained model and perform clustering search with the IVF_Flat algorithm to identify clone candidates. These candidates are then filtered through an entropy-based method to enhance accuracy and avoid false positive results. Moreover, we leverage the information from function-level clone detection results to further conduct file and block level clone detection. We evaluate our approach on the BigCloneBench benchmark. Experimental results show that our approach only takes 0.23 ms to search clone results among 800,000 functions and achieves high precision and recall.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"546-561"},"PeriodicalIF":5.6,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145593072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1109/TSE.2025.3636150
Wei Ding;Ran Mo;Chaochao Wu;Haopeng Song;Hang Fu;Xinya Mu
Software architecture is the abstraction of a software system, that significantly influences software development and maintenance. As software evolves, continuous changes could deviate its architecture from the original design, leading to architecture degradation that causes a decline in software quality. Architecture refactoring becomes necessary to address or mitigate architecture degradation for improving overall quality. Although researchers have developed various architecture refactoring tools and techniques, there has been limited research on how architecture refactoring is practiced in real-world scenarios. In this paper, we conducted an empirical study by analyzing posts from Stack Overflow to understand architecture refactoring in practice. Through our analysis of 694 posts with 3,468 discussion threads, we identified 12 types of architecture refactoring based on two classification dimensions. Additionally, we categorized architecture problems faced by practitioners and explored their corresponding refactoring solutions. Furthermore, we revealed six potential risks that may result from architecture refactoring. We believe that our study can provide valuable insights for practitioners to perform architecture refactoring effectively. The findings can serve as a foundation for future research and offer practical guidance to improve architecture quality.
{"title":"Exploring and Analyzing Software Architecture Refactoring in Practice","authors":"Wei Ding;Ran Mo;Chaochao Wu;Haopeng Song;Hang Fu;Xinya Mu","doi":"10.1109/TSE.2025.3636150","DOIUrl":"10.1109/TSE.2025.3636150","url":null,"abstract":"Software architecture is the abstraction of a software system, that significantly influences software development and maintenance. As software evolves, continuous changes could deviate its architecture from the original design, leading to architecture degradation that causes a decline in software quality. Architecture refactoring becomes necessary to address or mitigate architecture degradation for improving overall quality. Although researchers have developed various architecture refactoring tools and techniques, there has been limited research on how architecture refactoring is practiced in real-world scenarios. In this paper, we conducted an empirical study by analyzing posts from Stack Overflow to understand architecture refactoring in practice. Through our analysis of 694 posts with 3,468 discussion threads, we identified 12 types of architecture refactoring based on two classification dimensions. Additionally, we categorized architecture problems faced by practitioners and explored their corresponding refactoring solutions. Furthermore, we revealed six potential risks that may result from architecture refactoring. We believe that our study can provide valuable insights for practitioners to perform architecture refactoring effectively. The findings can serve as a foundation for future research and offer practical guidance to improve architecture quality.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"286-303"},"PeriodicalIF":5.6,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145593515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1109/TSE.2025.3635120
Baicai Sun;Lina Gong;Yinan Guo;Dunwei Gong;Gaige Wang
A target path of Message Passing Interface (MPI) programs typically consists of several target sub-paths. During solving a test case that cover the target path using an intelligent optimization algorithm, we often find that there are some hard-to-cover target sub-paths, which limit the testing efficiency of the entire target path. Therefore, this paper proposes an approach of low-cost testing for path coverage of MPI programs using surrogate-assisted changeable multi-objective optimization, which is used to further improve the effectiveness and efficiency of test case generation. The proposed approach first establishes a changeable multi-objective optimization model, which is used to guide the generation of test cases. During solving the changeable multi-objective optimization model using an intelligent optimization algorithm, we then determine each hard-to-cover target sub-path and form a corresponding sample set. Finally, we manage the surrogate model corresponding to each hard-to-cover target sub-path based on the formed sample set, and select superior evolutionary individuals to really execute the MPI program under test, thus reducing the cost and times of program execution. The proposed approach has been applied to path coverage testing of several benchmark MPI programs, and compared with several state-of-the-art approaches. The experimental results show that the proposed approach significantly improves the effectiveness and efficiency of generating test cases.
{"title":"Low-Cost Testing for Path Coverage of MPI Programs Using Surrogate-Assisted Changeable Multi-Objective Optimization","authors":"Baicai Sun;Lina Gong;Yinan Guo;Dunwei Gong;Gaige Wang","doi":"10.1109/TSE.2025.3635120","DOIUrl":"10.1109/TSE.2025.3635120","url":null,"abstract":"A target path of <bold>M</b>essage <bold>P</b>assing <bold>I</b>nterface (MPI) programs typically consists of several target sub-paths. During solving a test case that cover the target path using an intelligent optimization algorithm, we often find that there are some hard-to-cover target sub-paths, which limit the testing efficiency of the entire target path. Therefore, this paper proposes an approach of low-cost testing for path coverage of MPI programs using surrogate-assisted changeable multi-objective optimization, which is used to further improve the effectiveness and efficiency of test case generation. The proposed approach first establishes a changeable multi-objective optimization model, which is used to guide the generation of test cases. During solving the changeable multi-objective optimization model using an intelligent optimization algorithm, we then determine each hard-to-cover target sub-path and form a corresponding sample set. Finally, we manage the surrogate model corresponding to each hard-to-cover target sub-path based on the formed sample set, and select superior evolutionary individuals to really execute the MPI program under test, thus reducing the cost and times of program execution. The proposed approach has been applied to path coverage testing of several benchmark MPI programs, and compared with several state-of-the-art approaches. The experimental results show that the proposed approach significantly improves the effectiveness and efficiency of generating test cases.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"116-136"},"PeriodicalIF":5.6,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145593073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20DOI: 10.1109/TSE.2025.3634192
Chunying Zhou;Xiaoyuan Xie;Gong Chen;Peng He;Bing Li
Issue assignment plays a critical role in open-source software (OSS) maintenance, which involves recommending the most suitable developers to address the reported issues. Given the high volume of issue reports in large-scale projects, manually assigning issues is tedious and costly. Previous studies have proposed automated issue assignment approaches that primarily focus on modeling issue report textual information, developers’ expertise, or interactions between issues and developers based on historical issue-fixing records. However, these approaches often suffer from performance limitations due to the presence of incorrect and missing labels in OSS datasets, as well as the long tail of developer contributions and the changes in developer activity as the project evolves. To address these challenges, we propose IssueCourier, a novel Multi-Relational Heterogeneous Temporal Graph Neural Network approach for issue assignment. Specifically, we formalize five key relationships among issues, developers, and source code files to construct a heterogeneous graph. Then, we further adopt a temporal slicing technique that partitions the graph into a sequence of time-based subgraphs to learn stage-specific patterns. Furthermore, we provide a benchmark dataset with relabeled ground truth to address the problem of incorrect and missing labels in existing OSS datasets. Finally, to evaluate the performance of IssueCourier, we conduct extensive experiments on our benchmark dataset. The results show that IssueCourier can improve over the best baseline up to 45.49% in top-1 and 31.97% in MRR.
{"title":"IssueCourier: Multi-Relational Heterogeneous Temporal Graph Neural Network for Open-Source Issue Assignment","authors":"Chunying Zhou;Xiaoyuan Xie;Gong Chen;Peng He;Bing Li","doi":"10.1109/TSE.2025.3634192","DOIUrl":"10.1109/TSE.2025.3634192","url":null,"abstract":"Issue assignment plays a critical role in open-source software (OSS) maintenance, which involves recommending the most suitable developers to address the reported issues. Given the high volume of issue reports in large-scale projects, manually assigning issues is tedious and costly. Previous studies have proposed automated issue assignment approaches that primarily focus on modeling issue report textual information, developers’ expertise, or interactions between issues and developers based on historical issue-fixing records. However, these approaches often suffer from performance limitations due to the presence of incorrect and missing labels in OSS datasets, as well as the long tail of developer contributions and the changes in developer activity as the project evolves. To address these challenges, we propose IssueCourier, a novel Multi-Relational Heterogeneous Temporal Graph Neural Network approach for issue assignment. Specifically, we formalize five key relationships among issues, developers, and source code files to construct a heterogeneous graph. Then, we further adopt a temporal slicing technique that partitions the graph into a sequence of time-based subgraphs to learn stage-specific patterns. Furthermore, we provide a benchmark dataset with relabeled ground truth to address the problem of incorrect and missing labels in existing OSS datasets. Finally, to evaluate the performance of IssueCourier, we conduct extensive experiments on our benchmark dataset. The results show that IssueCourier can improve over the best baseline up to 45.49% in top-1 and 31.97% in MRR.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"527-545"},"PeriodicalIF":5.6,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145559535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Logic synthesis tools translate Hardware Description Language (HDL) designs into hardware implementation. To test these tools, numerous test cases are usually executed on the tools, yet only a few of them can trigger faults, leading to inefficient testing. Since executing test cases on logic synthesis tools often requires significant cost on complicated synthesis and simulation, fault-triggering test cases should be prioritized to execute. However, existing prioritization methods face challenges in accurately predicting the fault-triggering capability of dynamically generated test cases and modeling the unique syntactic and structure complexities of these HDL-based programs. Therefore, we propose ProFuse, a multi-dimensional feature fusion method for logic synthesis tool test case prioritization. ProFuse leverages Abstract Syntax Trees (AST) and Data Flow Graphs (DFG) to extract novel syntactic and structure features from HDL designs. These features are processed by a joint model of Multilayer Perceptron (MLP) and Graph Convolutional Network (GCN) to rank fault-triggering test cases accurately. ProFuse achieves an Average Percentage of Fault Detection (APFD) score of 0.9285, outperforming the state-of-the-art prioritization methods by 11.38% to 82.49%. ProFuse can efficiently rank randomly generated test cases to discover 15 new faults in logic synthesis tools (i.e., Yosys and Vivado). The Vivado community acknowledged our work for improving their tool.
{"title":"ProFuse: Test Case Prioritization Based on Multi Dimensional Feature Fusion for Logic Synthesis Tools Testing Acceleration","authors":"Peiyu Zou;Xiaochen Li;Xu Zhao;Shikai Guo;Zhide Zhou;Yue Ma;He Jiang","doi":"10.1109/TSE.2025.3634318","DOIUrl":"10.1109/TSE.2025.3634318","url":null,"abstract":"Logic synthesis tools translate Hardware Description Language (HDL) designs into hardware implementation. To test these tools, numerous test cases are usually executed on the tools, yet only a few of them can trigger faults, leading to inefficient testing. Since executing test cases on logic synthesis tools often requires significant cost on complicated synthesis and simulation, fault-triggering test cases should be prioritized to execute. However, existing prioritization methods face challenges in accurately predicting the fault-triggering capability of dynamically generated test cases and modeling the unique syntactic and structure complexities of these HDL-based programs. Therefore, we propose ProFuse, a multi-dimensional feature fusion method for logic synthesis tool test case prioritization. ProFuse leverages Abstract Syntax Trees (AST) and Data Flow Graphs (DFG) to extract novel syntactic and structure features from HDL designs. These features are processed by a joint model of Multilayer Perceptron (MLP) and Graph Convolutional Network (GCN) to rank fault-triggering test cases accurately. ProFuse achieves an Average Percentage of Fault Detection (APFD) score of 0.9285, outperforming the state-of-the-art prioritization methods by 11.38% to 82.49%. ProFuse can efficiently rank randomly generated test cases to discover 15 new faults in logic synthesis tools (i.e., Yosys and Vivado). The Vivado community acknowledged our work for improving their tool.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"304-320"},"PeriodicalIF":5.6,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145559522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Malicious code detection is one of the most essential tasks in safeguarding against security breaches, data compromise, and related threats. While machine learning has emerged as a predominant method for pattern detection, the training process is intricate due to the severe scarcity of malicious code samples. Consequently, machine learning detectors often encounter malicious patterns in limited and isolated scenarios, hindering their ability to generalize effectively across diverse threat landscapes. In this paper, we introduce MalCoder, a novel method for synthesizing malicious code samples. MalCoder enlarges the quantity and diversity of malicious instances by transplanting a set of malicious prototypes into a vast pool of benign code, thereby crafting a diverse array of malicious instances tailored to various application scenarios. For each malware prototype, MalCoder treats it as an incomplete code fragment and crafts its preceding and subsequent contexts through right-to-left and left-to-right code completion respectively. By leveraging GPTs with various sampling strategies, we can instantiate a large number of code samples bearing the malware prototype. Subsequently, MalCoder masks the original prototypes within the transplanted samples and fine-tunes an LLM code generator to reconstruct the original prototype. This process enables the model to seamlessly transplant malicious code fragments into benign code. During inference, MalCoder can automatically insert malicious fragments into benign samples at random positions, transforming benign code into malicious code. We apply MalCoder to a large pool of benign code in CodeSearchNet and craft over 50,000 malicious samples stemming from 39 malicious prototypes. Both qualitative and quantitative analyses show that the generated samples maintain key characteristics of malicious code while blending seamlessly with benign code, which helps in creating realistic and varied training data. Additionally, by using the generated samples as augmented training data, we witness a remarkable surge in malicious code detection capabilities. Specifically, the F1-score experiences a significant increase compared to utilizing only the original prototype samples.
{"title":"Synthetic Malware at Scale: Malicious Code Generation With Code Transplanting","authors":"Guangzhan Wang;Diwei Chen;Xiaodong Gu;Yuting Chen;Beijun Shen","doi":"10.1109/TSE.2025.3633280","DOIUrl":"10.1109/TSE.2025.3633280","url":null,"abstract":"Malicious code detection is one of the most essential tasks in safeguarding against security breaches, data compromise, and related threats. While machine learning has emerged as a predominant method for pattern detection, the training process is intricate due to the severe scarcity of malicious code samples. Consequently, machine learning detectors often encounter malicious patterns in limited and isolated scenarios, hindering their ability to generalize effectively across diverse threat landscapes. In this paper, we introduce MalCoder, a novel method for synthesizing malicious code samples. MalCoder enlarges the quantity and diversity of malicious instances by transplanting a set of malicious prototypes into a vast pool of benign code, thereby crafting a diverse array of malicious instances tailored to various application scenarios. For each malware prototype, MalCoder treats it as an incomplete code fragment and crafts its preceding and subsequent contexts through right-to-left and left-to-right code completion respectively. By leveraging GPTs with various sampling strategies, we can instantiate a large number of code samples bearing the malware prototype. Subsequently, MalCoder masks the original prototypes within the transplanted samples and fine-tunes an LLM code generator to reconstruct the original prototype. This process enables the model to seamlessly transplant malicious code fragments into benign code. During inference, MalCoder can automatically insert malicious fragments into benign samples at random positions, transforming benign code into malicious code. We apply MalCoder to a large pool of benign code in CodeSearchNet and craft over 50,000 malicious samples stemming from 39 malicious prototypes. Both qualitative and quantitative analyses show that the generated samples maintain key characteristics of malicious code while blending seamlessly with benign code, which helps in creating realistic and varied training data. Additionally, by using the generated samples as augmented training data, we witness a remarkable surge in malicious code detection capabilities. Specifically, the F1-score experiences a significant increase compared to utilizing only the original prototype samples.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"171-186"},"PeriodicalIF":5.6,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145535261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}