Automated Software Engineering最新文献

MP: motion program synthesis with machine learning interpretability and knowledge graph analogy

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering

Pub Date : 2025-02-18 DOI: 10.1007/s10515-025-00495-8

Cheng-Hao Cai

The advancement of physics-based engines has led to the popularity of virtual reality. To achieve a more realistic and immersive user experience, the behaviours of objects in virtual scenes are expected to conform to real-world physical laws accurately. This increases the workload and development time for developers. To facilitate development on physics-based engines, this paper proposes MP that is a motion program synthesis approach based on machine learning and analogical reasoning. MP follows the paradigm of test-driven development, where programs are generated to fit test cases of motions subject to multiple environmental factors such as gravity and airflows. To reduce the search space of code generation, regression models are used to find variables that cause significant influences to motions, while analogical reasoning on knowledge graphs is used to find operators that work for the found variables. Besides, constraint solving is used to probabilistically estimate the values of constants in motion programs. Experimental results have demonstrated that MP is efficient in various motion program generation tasks, with random forest regressors achieving low data and time requirements.

{"title":"MP: motion program synthesis with machine learning interpretability and knowledge graph analogy","authors":"Cheng-Hao Cai","doi":"10.1007/s10515-025-00495-8","DOIUrl":"10.1007/s10515-025-00495-8","url":null,"abstract":"<div><p>The advancement of physics-based engines has led to the popularity of virtual reality. To achieve a more realistic and immersive user experience, the behaviours of objects in virtual scenes are expected to conform to real-world physical laws accurately. This increases the workload and development time for developers. To facilitate development on physics-based engines, this paper proposes MP that is a motion program synthesis approach based on machine learning and analogical reasoning. MP follows the paradigm of test-driven development, where programs are generated to fit test cases of motions subject to multiple environmental factors such as gravity and airflows. To reduce the search space of code generation, regression models are used to find variables that cause significant influences to motions, while analogical reasoning on knowledge graphs is used to find operators that work for the found variables. Besides, constraint solving is used to probabilistically estimate the values of constants in motion programs. Experimental results have demonstrated that MP is efficient in various motion program generation tasks, with random forest regressors achieving low data and time requirements.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00495-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143438677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLM-enhanced evolutionary test generation for untyped languages

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering

Pub Date : 2025-02-17 DOI: 10.1007/s10515-025-00496-7

Ruofan Yang, Xianghua Xu, Ran Wang

Dynamic programming languages, such as Python, are widely used for their flexibility and support for rapid development. However, the absence of explicit parameter type declarations poses significant challenges in generating automated test cases. This often leads to random assignment of parameter types, increasing the search space and reducing testing efficiency. Current evolutionary algorithms, which rely heavily on random mutations, struggle to handle specific data types and frequently fall into local optima, making it difficult to generate high-quality test cases. Moreover, the resulting test suites often contain errors, preventing immediate usage in real-world applications. To address these challenges, this paper proposes the use of large language models to enhance test case generation for dynamic programming languages. Our method involves three key steps: analyzing parameter types to narrow the search space, introducing meaningful data during mutations to increase test case relevance, and using large language models to automatically repair errors in the generated test suites. Experimental results demonstrate a 16% improvement in test coverage, faster evolutionary cycles, and an increase in the number of executable test suites. These findings highlight the potential of large language models in improving both the efficiency and reliability of test case generation for dynamic programming languages.

Python 等动态编程语言因其灵活性和对快速开发的支持而被广泛使用。然而，由于缺乏明确的参数类型声明，在生成自动测试案例时面临巨大挑战。这通常会导致参数类型的随机分配，增加搜索空间，降低测试效率。当前的进化算法严重依赖随机突变，很难处理特定的数据类型，而且经常陷入局部最优状态，因此很难生成高质量的测试用例。此外，生成的测试套件往往包含错误，无法立即用于实际应用。为了应对这些挑战，本文提出使用大型语言模型来增强动态编程语言的测试用例生成。我们的方法包括三个关键步骤：分析参数类型以缩小搜索空间；在突变过程中引入有意义的数据以提高测试用例的相关性；使用大型语言模型自动修复生成的测试套件中的错误。实验结果表明，测试覆盖率提高了 16%，进化周期加快，可执行测试套件的数量增加。这些发现凸显了大型语言模型在提高动态编程语言测试用例生成的效率和可靠性方面的潜力。

{"title":"LLM-enhanced evolutionary test generation for untyped languages","authors":"Ruofan Yang, Xianghua Xu, Ran Wang","doi":"10.1007/s10515-025-00496-7","DOIUrl":"10.1007/s10515-025-00496-7","url":null,"abstract":"<div><p>Dynamic programming languages, such as Python, are widely used for their flexibility and support for rapid development. However, the absence of explicit parameter type declarations poses significant challenges in generating automated test cases. This often leads to random assignment of parameter types, increasing the search space and reducing testing efficiency. Current evolutionary algorithms, which rely heavily on random mutations, struggle to handle specific data types and frequently fall into local optima, making it difficult to generate high-quality test cases. Moreover, the resulting test suites often contain errors, preventing immediate usage in real-world applications. To address these challenges, this paper proposes the use of large language models to enhance test case generation for dynamic programming languages. Our method involves three key steps: analyzing parameter types to narrow the search space, introducing meaningful data during mutations to increase test case relevance, and using large language models to automatically repair errors in the generated test suites. Experimental results demonstrate a 16% improvement in test coverage, faster evolutionary cycles, and an increase in the number of executable test suites. These findings highlight the potential of large language models in improving both the efficiency and reliability of test case generation for dynamic programming languages.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143423179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Context-aware code summarization with multi-relational graph neural network

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering

Pub Date : 2025-02-06 DOI: 10.1007/s10515-025-00490-z

Yanlin Wang, Ensheng Shi, Lun Du, Xiaodi Yang, Yuxuan Hu, Yanli Wang, Daya Guo, Shi Han, Hongyu Zhang, Dongmei Zhang

Source code summaries are short natural language descriptions of code snippets that help developers better understand and maintain source code. There has been a surge of work on automatic code summarization to reduce the burden of writing summaries manually. However, contemporary approaches only leverage the information within the boundary of the method being summarized (i.e., local context), and ignore the broader context that could assist with code summarization. This paper explores two global contexts, namely intra-class and inter-class contexts, and proposes CoCoSUM: Context-Aware Code Summarization with Multi-Relational Graph Neural Network. CoCoSUM first incorporates class names as the intra-class context to generate the class semantic embeddings. Then, relevant Unified Modeling Language (UML) class diagrams are extracted as inter-class context and are encoded into the class relational embeddings using a novel Multi-Relational Graph Neural Network (MRGNN). Class semantic embeddings and class relational embeddings, together with the outputs from code token encoder and AST encoder, are passed to a decoder armed with a two-level attention mechanism to generate high-quality, context-aware code summaries. Experimental results show that CoCoSUM outperforms state-of-the-art methods and the global contexts adopted in CoCoSUM can also strengthen existing code summarization models. Our replication package is anonymously available at https://github.com/DeepSoftwareAnalytics/cocosum.

{"title":"Context-aware code summarization with multi-relational graph neural network","authors":"Yanlin Wang, Ensheng Shi, Lun Du, Xiaodi Yang, Yuxuan Hu, Yanli Wang, Daya Guo, Shi Han, Hongyu Zhang, Dongmei Zhang","doi":"10.1007/s10515-025-00490-z","DOIUrl":"10.1007/s10515-025-00490-z","url":null,"abstract":"<div><p>Source code summaries are short natural language descriptions of code snippets that help developers better understand and maintain source code. There has been a surge of work on automatic code summarization to reduce the burden of writing summaries manually. However, contemporary approaches only leverage the information within the boundary of the method being summarized (i.e., local context), and ignore the broader context that could assist with code summarization. This paper explores two global contexts, namely intra-class and inter-class contexts, and proposes CoCoSUM: Context-Aware Code Summarization with Multi-Relational Graph Neural Network. CoCoSUM first incorporates class names as the intra-class context to generate the class semantic embeddings. Then, relevant Unified Modeling Language (UML) class diagrams are extracted as inter-class context and are encoded into the class relational embeddings using a novel Multi-Relational Graph Neural Network (MRGNN). Class semantic embeddings and class relational embeddings, together with the outputs from code token encoder and AST encoder, are passed to a decoder armed with a two-level attention mechanism to generate high-quality, context-aware code summaries. Experimental results show that CoCoSUM outperforms state-of-the-art methods and the global contexts adopted in CoCoSUM can also strengthen existing code summarization models. Our replication package is anonymously available at https://github.com/DeepSoftwareAnalytics/cocosum.\u0000</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing multi-objective test case selection through the mutation operator

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering

Pub Date : 2025-01-30 DOI: 10.1007/s10515-025-00489-6

Miriam Ugarte, Pablo Valle, Miren Illarramendi, Aitor Arrieta

Test case selection has been a widely investigated technique to increase the cost-effectiveness of software testing. Because the search space in this problem is huge, search-based approaches have been found effective, where an optimization algorithm (e.g., a genetic algorithm) applies mutation and crossover operators guided by corresponding objective functions with the goal of reducing the test execution cost while maintaining the overall test quality. The de-facto mutation operator is the bit-flip mutation, where a test case is mutated with a probability of 1/N, N being the total number of test cases in the original test suite. This has a core disadvantage: an effective test case and an ineffective one have the same probability of being selected or removed. In this paper, we advocate for a novel mutation operator that promotes selecting cost-effective test cases while removing the ineffective and expensive ones. To this end, instead of applying a probability of 1/N to every single test case in the original test suite, we calculate new selection and removal probabilities. This is carried out based on the adequacy criterion as well as the cost of each test case, determined before executing the algorithm (e.g., based on historical data). We evaluate our approach in 13 case study system, including 3 industrial case studies, in three different application domains (i.e., Cyber-Physical Systems (CPSs), continuous integration systems and industrial control systems). Our results suggests that the proposed approach can increase the cost-effectiveness of search-based test case selection methods, especially when the time budget for executing test cases is low.

{"title":"Enhancing multi-objective test case selection through the mutation operator","authors":"Miriam Ugarte, Pablo Valle, Miren Illarramendi, Aitor Arrieta","doi":"10.1007/s10515-025-00489-6","DOIUrl":"10.1007/s10515-025-00489-6","url":null,"abstract":"<div><p>Test case selection has been a widely investigated technique to increase the cost-effectiveness of software testing. Because the search space in this problem is huge, search-based approaches have been found effective, where an optimization algorithm (e.g., a genetic algorithm) applies mutation and crossover operators guided by corresponding objective functions with the goal of reducing the test execution cost while maintaining the overall test quality. The de-facto mutation operator is the bit-flip mutation, where a test case is mutated with a probability of 1/<i>N</i>, <i>N</i> being the total number of test cases in the original test suite. This has a core disadvantage: an effective test case and an ineffective one have the same probability of being selected or removed. In this paper, we advocate for a novel mutation operator that promotes selecting cost-effective test cases while removing the ineffective and expensive ones. To this end, instead of applying a probability of 1/<i>N</i> to every single test case in the original test suite, we calculate new selection and removal probabilities. This is carried out based on the adequacy criterion as well as the cost of each test case, determined before executing the algorithm (e.g., based on historical data). We evaluate our approach in 13 case study system, including 3 industrial case studies, in three different application domains (i.e., Cyber-Physical Systems (CPSs), continuous integration systems and industrial control systems). Our results suggests that the proposed approach can increase the cost-effectiveness of search-based test case selection methods, especially when the time budget for executing test cases is low.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143110071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BadCodePrompt: backdoor attacks against prompt engineering of large language models for code generation

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering

Pub Date : 2025-01-28 DOI: 10.1007/s10515-024-00485-2

Yubin Qu, Song Huang, Yanzhou Li, Tongtong Bai, Xiang Chen, Xingya Wang, Long Li, Yongming Yao

Using few-shot demonstrations in prompts significantly enhances the generation quality of large language models (LLMs), including code generation. However, adversarial examples injected by malicious service providers via few-shot prompting pose a risk of backdoor attacks in large language models. There is no research on backdoor attacks on large language models in the few-shot prompting setting for code generation tasks. In this paper, we propose BadCodePrompt, the first backdoor attack for code generation tasks targeting LLMS in the few-shot prompting scenario, without requiring access to training data or model parameters and with lower computational overhead. BadCodePrompt exploits the insertion of triggers and poisonous code patterns into examples, causing the output of poisonous source code when there is a backdoor trigger in the end user’s query prompt. We demonstrate the effectiveness of BadCodePrompt in conducting backdoor attacks on three LLMS (GPT-4, Claude-3.5-Sonnet, and Gemini Pro-1.5) in code generation tasks without affecting the functionality of the generated code. LLMs with stronger reasoning capabilities are also more vulnerable to BadCodePrompt, with an average attack success rate of up to 98.53% for GPT-4 in two benchmark tasks. Finally, we employ state-of-the-art defenses against backdoor attacks in Prompt Engineering and show their overall ineffectiveness against BadCodePrompt. Therefore, BadCodePrompt remains a serious threat to LLMS, underscoring the urgency of developing effective defense mechanisms.

{"title":"BadCodePrompt: backdoor attacks against prompt engineering of large language models for code generation","authors":"Yubin Qu, Song Huang, Yanzhou Li, Tongtong Bai, Xiang Chen, Xingya Wang, Long Li, Yongming Yao","doi":"10.1007/s10515-024-00485-2","DOIUrl":"10.1007/s10515-024-00485-2","url":null,"abstract":"<div><p>Using few-shot demonstrations in prompts significantly enhances the generation quality of large language models (LLMs), including code generation. However, adversarial examples injected by malicious service providers via few-shot prompting pose a risk of backdoor attacks in large language models. There is no research on backdoor attacks on large language models in the few-shot prompting setting for code generation tasks. In this paper, we propose <span>BadCodePrompt</span>, the first backdoor attack for code generation tasks targeting LLMS in the few-shot prompting scenario, without requiring access to training data or model parameters and with lower computational overhead. <span>BadCodePrompt</span> exploits the insertion of triggers and poisonous code patterns into examples, causing the output of poisonous source code when there is a backdoor trigger in the end user’s query prompt. We demonstrate the effectiveness of <span>BadCodePrompt</span> in conducting backdoor attacks on three LLMS (GPT-4, Claude-3.5-Sonnet, and Gemini Pro-1.5) in code generation tasks without affecting the functionality of the generated code. LLMs with stronger reasoning capabilities are also more vulnerable to <span>BadCodePrompt</span>, with an average attack success rate of up to 98.53% for GPT-4 in two benchmark tasks. Finally, we employ state-of-the-art defenses against backdoor attacks in Prompt Engineering and show their overall ineffectiveness against <span>BadCodePrompt</span>. Therefore, <span>BadCodePrompt</span> remains a serious threat to LLMS, underscoring the urgency of developing effective defense mechanisms.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143109964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code search

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering

Pub Date : 2025-01-27 DOI: 10.1007/s10515-025-00487-8

Gong Chen, Wenjie Liu, Xiaoyuan Xie

Code search is a crucial task in software engineering, aiming to search relevant code from the codebase based on natural language queries. While deep-learning-based code search methods have demonstrated impressive performance, recent advances in contrastive learning have further enhanced the representation learning of these models. Despite these improvements, existing methods still have limitations in the representation learning of multi-modal data. Specifically, these methods suffer from a semantic loss in the representation learning of code and fail to explore functionally relevant code pairs in the representation learning fully. To address these limitations, we propose A Representation Fusion based Multi-View Momentum Contrastive Learning Framework for Code Search, named RFMC-CS. RFMC-CS effectively retains the semantic and structural information of code through multi-modal representation and fusion. Through elaborately designed Multi-View Momentum Contrastive Learning, RFMC-CS can further learn the correlations between different modalities of samples and semantic relevant samples. The experimental results on the CodeSearchNet benchmark show that RFMC-CS outperforms seven advanced baselines on MRR and Recall@k metrics. The ablation experiments illustrate the effectiveness of each component. The portability experiments show that RFMC-CS has good portability.

{"title":"RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code search","authors":"Gong Chen, Wenjie Liu, Xiaoyuan Xie","doi":"10.1007/s10515-025-00487-8","DOIUrl":"10.1007/s10515-025-00487-8","url":null,"abstract":"<div><p>Code search is a crucial task in software engineering, aiming to search relevant code from the codebase based on natural language queries. While deep-learning-based code search methods have demonstrated impressive performance, recent advances in contrastive learning have further enhanced the representation learning of these models. Despite these improvements, existing methods still have limitations in the representation learning of multi-modal data. Specifically, these methods suffer from a semantic loss in the representation learning of code and fail to explore functionally relevant code pairs in the representation learning fully. To address these limitations, we propose <i>A</i> <i><u>R</u></i><i>epresentation</i> <i><u>F</u></i><i>usion based</i> <i><u>M</u></i><i>ulti-View Momentum</i> <i><u>C</u></i><i>ontrastive Learning Framework for</i> <i><u>C</u></i><i>ode</i> <i><u>S</u></i><i>earch</i>, <i>named RFMC-CS</i>. <i>RFMC-CS</i> effectively retains the semantic and structural information of code through multi-modal representation and fusion. Through elaborately designed Multi-View Momentum Contrastive Learning, <i>RFMC-CS</i> can further learn the correlations between different modalities of samples and semantic relevant samples. The experimental results on the CodeSearchNet benchmark show that <i>RFMC-CS</i> outperforms seven advanced baselines on MRR and Recall@k metrics. The ablation experiments illustrate the effectiveness of each component. The portability experiments show that <i>RFMC-CS</i> has good portability.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143109689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Large language model based mutations in genetic improvement 基于基因改良突变的大型语言模型

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering

Pub Date : 2025-01-21 DOI: 10.1007/s10515-024-00473-6

Alexander E. I. Brownlee, James Callan, Karine Even-Mendoza, Alina Geiger, Carol Hanna, Justyna Petke, Federica Sarro, Dominik Sobania

Ever since the first large language models (LLMs) have become available, both academics and practitioners have used them to aid software engineering tasks. However, little research as yet has been done in combining search-based software engineering (SBSE) and LLMs. In this paper, we evaluate the use of LLMs as mutation operators for genetic improvement (GI), an SBSE approach, to improve the GI search process. In a preliminary work, we explored the feasibility of combining the Gin Java GI toolkit with OpenAI LLMs in order to generate an edit for the JCodec tool. Here we extend this investigation involving three LLMs and three types of prompt, and five real-world software projects. We sample the edits at random, as well as using local search. We also conducted a qualitative analysis to understand why LLM-generated code edits break as part of our evaluation. Our results show that, compared with conventional statement GI edits, LLMs produce fewer unique edits, but these compile and pass tests more often, with the OpenAI model finding test-passing edits 77% of the time. The OpenAI and Mistral LLMs are roughly equal in finding the best run-time improvements. Simpler prompts are more successful than those providing more context and examples. The qualitative analysis reveals a wide variety of areas where LLMs typically fail to produce valid edits commonly including inconsistent formatting, generating non-Java syntax, or refusing to provide a solution.

自从第一个大型语言模型（llm）出现以来，学者和实践者都使用它们来帮助软件工程任务。然而，将基于搜索的软件工程（SBSE）和法学硕士结合起来的研究还很少。在本文中，我们评估了llm作为基因改进（GI）的突变算子的使用，这是一种SBSE方法，以改进GI搜索过程。在初步工作中，我们探索了将Gin Java GI工具包与OpenAI llm相结合的可行性，以便为JCodec工具生成编辑。在这里，我们扩展了这项调查，涉及三个llm和三种类型的提示，以及五个现实世界的软件项目。我们随机取样编辑，以及使用本地搜索。作为评估的一部分，我们还进行了定性分析，以理解llm生成的代码编辑中断的原因。我们的结果表明，与传统的语句GI编辑相比，llm产生的唯一编辑更少，但这些编辑更经常地编译并通过测试，OpenAI模型发现77%的编辑通过了测试。OpenAI和Mistral llm在寻找最佳运行时改进方面大致相同。简单的提示比提供更多上下文和示例的提示更成功。定性分析揭示了llm通常无法生成有效编辑的各种领域，包括不一致的格式、生成非java语法或拒绝提供解决方案。

{"title":"Large language model based mutations in genetic improvement","authors":"Alexander E. I. Brownlee, James Callan, Karine Even-Mendoza, Alina Geiger, Carol Hanna, Justyna Petke, Federica Sarro, Dominik Sobania","doi":"10.1007/s10515-024-00473-6","DOIUrl":"10.1007/s10515-024-00473-6","url":null,"abstract":"<div><p>Ever since the first large language models (LLMs) have become available, both academics and practitioners have used them to aid software engineering tasks. However, little research as yet has been done in combining search-based software engineering (SBSE) and LLMs. In this paper, we evaluate the use of LLMs as mutation operators for genetic improvement (GI), an SBSE approach, to improve the GI search process. In a preliminary work, we explored the feasibility of combining the <i>Gin</i> Java GI toolkit with OpenAI LLMs in order to generate an edit for the <span>JCodec</span> tool. Here we extend this investigation involving three LLMs and three types of prompt, and five real-world software projects. We sample the edits at random, as well as using local search. We also conducted a qualitative analysis to understand why LLM-generated code edits break as part of our evaluation. Our results show that, compared with conventional statement GI edits, LLMs produce fewer unique edits, but these compile and pass tests more often, with the <span>OpenAI</span> model finding test-passing edits 77% of the time. The <span>OpenAI</span> and <span>Mistral</span> LLMs are roughly equal in finding the best run-time improvements. Simpler prompts are more successful than those providing more context and examples. The qualitative analysis reveals a wide variety of areas where LLMs typically fail to produce valid edits commonly including inconsistent formatting, generating non-Java syntax, or refusing to provide a solution.\u0000</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-024-00473-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142995345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vulnerability detection with graph enhancement and global dependency representation learning 基于图增强和全局依赖表示学习的漏洞检测

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering

Pub Date : 2025-01-05 DOI: 10.1007/s10515-024-00484-3

Xuehai Jia, Junwei Du, Minying Fang, Hao Liu, Yuying Li, Feng Jiang

Vulnerability detection is essential for protecting software systems from attacks. Graph neural networks (GNNs) have proven effective in capturing semantic features of code and are widely used for this purpose. Existing GNN-based methods typically merge multiple graphs and employ GNNs to learn syntactic and semantic relationships within code graph structures. However, these methods face a significant limitation: current code graph structures inadequately represent parameter dependencies and node type information, which are crucial for capturing vulnerability patterns. This inadequacy hampers the GNNs’ ability to discern and characterize vulnerable code, thereby undermining effective vulnerability detection. Additionally, traditional GNN-based methods may lose long-distance dependency information during aggregation, which is vital for understanding the behavior and occurrence patterns of vulnerable code. Despite achieving state-of-the-art performance, existing GNN-based methods struggle to fully understand vulnerability behaviors and their potential impacts. To address these issues, this paper introduces VulDecgre, a novel vulnerability detection model comprising two components: (1) An enhanced code graph structure that fuses multiple graphs and relational edges to improve code representation. (2) A natural sequence-aware learning module that integrates code execution sequence information to enhance vulnerability detection. Extensive experiments on three public datasets and a self-collected large-scale real-world C/C++ dataset demonstrate that VulDecgre achieves superior performance in vulnerability detection.

漏洞检测对于保护软件系统免受攻击至关重要。图神经网络（gnn）已被证明在捕获代码的语义特征方面是有效的，并被广泛用于这一目的。现有的基于gnn的方法通常是合并多个图，并使用gnn来学习代码图结构中的语法和语义关系。然而，这些方法面临着一个明显的限制：当前的代码图结构不能充分表示参数依赖关系和节点类型信息，而这些信息对于捕获漏洞模式至关重要。这种不足阻碍了gnn识别和表征易受攻击代码的能力，从而破坏了有效的漏洞检测。此外，传统的基于gnn的方法在聚合过程中可能会丢失远程依赖信息，这对于理解脆弱代码的行为和发生模式至关重要。尽管实现了最先进的性能，但现有的基于gnn的方法难以充分理解漏洞行为及其潜在影响。为了解决这些问题，本文引入了一种新的漏洞检测模型VulDecgre，该模型由两个部分组成：(1)一种增强的代码图结构，融合了多个图和关系边，以改善代码表示。(2)自然序列感知学习模块，集成代码执行序列信息，增强漏洞检测能力。在三个公共数据集和一个自收集的大规模真实C/ c++数据集上进行的大量实验表明，VulDecgre在漏洞检测方面取得了优异的性能。

{"title":"Vulnerability detection with graph enhancement and global dependency representation learning","authors":"Xuehai Jia, Junwei Du, Minying Fang, Hao Liu, Yuying Li, Feng Jiang","doi":"10.1007/s10515-024-00484-3","DOIUrl":"10.1007/s10515-024-00484-3","url":null,"abstract":"<div><p>Vulnerability detection is essential for protecting software systems from attacks. Graph neural networks (GNNs) have proven effective in capturing semantic features of code and are widely used for this purpose. Existing GNN-based methods typically merge multiple graphs and employ GNNs to learn syntactic and semantic relationships within code graph structures. However, these methods face a significant limitation: current code graph structures inadequately represent parameter dependencies and node type information, which are crucial for capturing vulnerability patterns. This inadequacy hampers the GNNs’ ability to discern and characterize vulnerable code, thereby undermining effective vulnerability detection. Additionally, traditional GNN-based methods may lose long-distance dependency information during aggregation, which is vital for understanding the behavior and occurrence patterns of vulnerable code. Despite achieving state-of-the-art performance, existing GNN-based methods struggle to fully understand vulnerability behaviors and their potential impacts. To address these issues, this paper introduces VulDecgre, a novel vulnerability detection model comprising two components: (1) An enhanced code graph structure that fuses multiple graphs and relational edges to improve code representation. (2) A natural sequence-aware learning module that integrates code execution sequence information to enhance vulnerability detection. Extensive experiments on three public datasets and a self-collected large-scale real-world C/C++ dataset demonstrate that VulDecgre achieves superior performance in vulnerability detection.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detecting question relatedness in programming Q&A communities via bimodal feature fusion 基于双峰特征融合的编程问答社区问题相关性检测

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering

Pub Date : 2025-01-04 DOI: 10.1007/s10515-024-00482-5

Qirong Bu, Xiangqiang Guo, Xia Sun, Jingjing Jiang, Xiaodi Zhao, Wang Zou, Xuxin Wang, Jianqiang Yan

Programming community-based question and answering websites, represented by Stack Overflow, are popular among programmers. Users post questions and share their knowledge and experience through answering. Nonetheless, the accumulation of a large number of similar questions reduces the efficiency and quality of the community. To tackle this issue, related works utilize the complete textual information in the question posts for detecting question relatedness. But they almost all ignore the rich source code information in the posts, which also complements the semantics of the questions. In this paper, we propose a bimodal framework for relatedness detection based on the combination of text features and code features. Question pairs are encoded using a text pre-trained language model (e.g., SOBERT) and a code pre-trained language model (e.g., UniXcoder), respectively. With the powerful semantic modeling capabilities of pre-trained models, we obtain bimodal features that measure the similarity of questions from both text and code perspectives. However, directly concatenating and fusing these features may have a negative impact due to the significant differences between them. To address this, we additionally leverage the cross-attention mechanism to derive supplementary features of these bimodal features for the correct feature fusion. Cross-attention captures semantic understanding from both modalities, integrating their representations. These supplementary features measure the semantic relationship between text-guided and code-guided features, effectively bridging the semantic gap. We conducted extensive experiments on two related datasets from both the English and Chinese domains. The results show that our approach improves significantly over the baseline approaches, achieving advanced performance in the metrics of Macro-Precision, Macro-Recall and Macro-F1.

以Stack Overflow为代表的编程社区问答网站在程序员中很受欢迎。用户发布问题，并通过回答分享他们的知识和经验。然而，大量类似问题的积累降低了社区的效率和质量。为了解决这一问题，相关工作利用问题贴中的完整文本信息来检测问题的相关性。但是他们几乎都忽略了帖子中丰富的源代码信息，这些信息也补充了问题的语义。本文提出了一种基于文本特征和代码特征相结合的双峰相关性检测框架。问题对分别使用文本预训练语言模型（例如SOBERT）和代码预训练语言模型（例如UniXcoder）进行编码。利用预训练模型强大的语义建模能力，我们获得了从文本和代码两个角度衡量问题相似性的双峰特征。但是，由于这些特征之间的差异很大，直接将它们连接和融合可能会产生负面影响。为了解决这个问题，我们还利用交叉注意机制来获得这些双峰特征的补充特征，以实现正确的特征融合。交叉注意捕获了两种模式的语义理解，整合了它们的表征。这些补充特性度量了文本引导和代码引导特性之间的语义关系，有效地弥合了语义差距。我们在英文和中文领域的两个相关数据集上进行了广泛的实验。结果表明，我们的方法在宏观精度、宏观召回率和宏观f1指标上取得了较好的性能。

{"title":"Detecting question relatedness in programming Q&A communities via bimodal feature fusion","authors":"Qirong Bu, Xiangqiang Guo, Xia Sun, Jingjing Jiang, Xiaodi Zhao, Wang Zou, Xuxin Wang, Jianqiang Yan","doi":"10.1007/s10515-024-00482-5","DOIUrl":"10.1007/s10515-024-00482-5","url":null,"abstract":"<div><p>Programming community-based question and answering websites, represented by Stack Overflow, are popular among programmers. Users post questions and share their knowledge and experience through answering. Nonetheless, the accumulation of a large number of similar questions reduces the efficiency and quality of the community. To tackle this issue, related works utilize the complete textual information in the question posts for detecting question relatedness. But they almost all ignore the rich source code information in the posts, which also complements the semantics of the questions. In this paper, we propose a bimodal framework for relatedness detection based on the combination of text features and code features. Question pairs are encoded using a text pre-trained language model (e.g., SOBERT) and a code pre-trained language model (e.g., UniXcoder), respectively. With the powerful semantic modeling capabilities of pre-trained models, we obtain bimodal features that measure the similarity of questions from both text and code perspectives. However, directly concatenating and fusing these features may have a negative impact due to the significant differences between them. To address this, we additionally leverage the cross-attention mechanism to derive supplementary features of these bimodal features for the correct feature fusion. Cross-attention captures semantic understanding from both modalities, integrating their representations. These supplementary features measure the semantic relationship between text-guided and code-guided features, effectively bridging the semantic gap. We conducted extensive experiments on two related datasets from both the English and Chinese domains. The results show that our approach improves significantly over the baseline approaches, achieving advanced performance in the metrics of Macro-Precision, Macro-Recall and Macro-F1.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adversarial generation method for smart contract fuzz testing seeds guided by chain-based LLM 基于链的LLM引导下的智能合约模糊测试种子对抗生成方法

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering

Pub Date : 2024-12-31 DOI: 10.1007/s10515-024-00483-4

Jiaze Sun, Zhiqiang Yin, Hengshan Zhang, Xiang Chen, Wei Zheng

With the rapid development of smart contract technology and the continuous expansion of blockchain application scenarios, the security issues of smart contracts have garnered significant attention. However, traditional fuzz testing typically relies on randomly generated initial seed sets. This random generation method fails to understand the semantics of smart contracts, resulting in insufficient seed coverage. Additionally, traditional fuzz testing often ignores the syntax and semantic constraints within smart contracts, leading to the generation of seeds that may not conform to the syntactic rules of the contracts and may even include logic that violates contract semantics, thereby reducing the efficiency of fuzz testing. To address these challenges, we propose a method for adversarial generation for smart contract fuzz testing seeds guided by Chain-Based LLM, leveraging the deep semantic understanding capabilities of LLM to assist in seed set generation. Firstly, we propose a method that utilizes Chain-Based prompts to request LLM to generate fuzz testing seeds, breaking down the LLM tasks into multiple steps to gradually guide the LLM in generating high-coverage seed sets. Secondly, by establishing adversarial roles for the LLM, we guide the LLM to autonomously generate and optimize seed sets, producing high-coverage initial seed sets for the program under test. To evaluate the effectiveness of the proposed method, 2308 smart contracts were crawled from Etherscan for experimental purposes. Results indicate that using Chain-Based prompts to request LLM to generate fuzz testing seed sets improved instruction coverage by 2.94% compared to single-step requests. The method of generating seed sets by establishing adversarial roles for the LLM reduced the time to reach maximum instruction coverage from 60 s to approximately 30 s compared to single-role methods. Additionally, the seed sets generated by the proposed method can directly trigger simple types of vulnerabilities (e.g., timestamp dependency and block number dependency vulnerabilities), with instruction coverage improvements of 3.8% and 4.1%, respectively.

随着智能合约技术的快速发展和区块链应用场景的不断扩展，智能合约的安全问题引起了人们的广泛关注。然而，传统的模糊测试通常依赖于随机生成的初始种子集。这种随机生成方法无法理解智能合约的语义，导致种子覆盖不足。此外，传统的模糊测试往往忽略了智能合约中的语法和语义约束，导致生成的种子可能不符合合约的语法规则，甚至可能包含违反合约语义的逻辑，从而降低了模糊测试的效率。为了解决这些挑战，我们提出了一种基于Chain-Based LLM的智能合约模糊测试种子的对抗生成方法，利用LLM的深度语义理解能力来辅助种子集的生成。首先，我们提出了一种利用基于链的提示请求LLM生成模糊测试种子的方法，将LLM任务分解为多个步骤，逐步引导LLM生成高覆盖率的种子集。其次，通过为LLM建立对抗性角色，我们引导LLM自主生成和优化种子集，为被测程序生成高覆盖率的初始种子集。为了评估所提出方法的有效性，从以太坊中抓取了2308个智能合约用于实验目的。结果表明，与单步请求相比，使用基于链的提示请求LLM生成模糊测试种子集的指令覆盖率提高了2.94%。与单角色方法相比，通过为LLM建立对抗角色来生成种子集的方法将达到最大指令覆盖的时间从60秒减少到大约30秒。此外，该方法生成的种子集可以直接触发简单类型的漏洞（如时间戳依赖漏洞和块号依赖漏洞），指令覆盖率分别提高3.8%和4.1%。

{"title":"Adversarial generation method for smart contract fuzz testing seeds guided by chain-based LLM","authors":"Jiaze Sun, Zhiqiang Yin, Hengshan Zhang, Xiang Chen, Wei Zheng","doi":"10.1007/s10515-024-00483-4","DOIUrl":"10.1007/s10515-024-00483-4","url":null,"abstract":"<div><p>With the rapid development of smart contract technology and the continuous expansion of blockchain application scenarios, the security issues of smart contracts have garnered significant attention. However, traditional fuzz testing typically relies on randomly generated initial seed sets. This random generation method fails to understand the semantics of smart contracts, resulting in insufficient seed coverage. Additionally, traditional fuzz testing often ignores the syntax and semantic constraints within smart contracts, leading to the generation of seeds that may not conform to the syntactic rules of the contracts and may even include logic that violates contract semantics, thereby reducing the efficiency of fuzz testing. To address these challenges, we propose a method for adversarial generation for smart contract fuzz testing seeds guided by Chain-Based LLM, leveraging the deep semantic understanding capabilities of LLM to assist in seed set generation. Firstly, we propose a method that utilizes Chain-Based prompts to request LLM to generate fuzz testing seeds, breaking down the LLM tasks into multiple steps to gradually guide the LLM in generating high-coverage seed sets. Secondly, by establishing adversarial roles for the LLM, we guide the LLM to autonomously generate and optimize seed sets, producing high-coverage initial seed sets for the program under test. To evaluate the effectiveness of the proposed method, 2308 smart contracts were crawled from Etherscan for experimental purposes. Results indicate that using Chain-Based prompts to request LLM to generate fuzz testing seed sets improved instruction coverage by 2.94% compared to single-step requests. The method of generating seed sets by establishing adversarial roles for the LLM reduced the time to reach maximum instruction coverage from 60 s to approximately 30 s compared to single-role methods. Additionally, the seed sets generated by the proposed method can directly trigger simple types of vulnerabilities (e.g., timestamp dependency and block number dependency vulnerabilities), with instruction coverage improvements of 3.8% and 4.1%, respectively.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142906108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0