首页 > 最新文献

arXiv - CS - Software Engineering最新文献

英文 中文
Python Symbolic Execution with LLM-powered Code Generation 利用 LLM 驱动的代码生成实现 Python 符号执行
Pub Date : 2024-09-14 DOI: arxiv-2409.09271
Wenhan Wang, Kaibo Liu, An Ran Chen, Ge Li, Zhi Jin, Gang Huang, Lei Ma
Symbolic execution is a key technology in software testing, which generatestest cases by collecting symbolic path constraints and then solving constraintswith SMT solvers. Symbolic execution has been proven helpful in generatinghigh-coverage test cases, but its limitations, e.g., the difficulties insolving path constraints, prevent it from broader usage in software testing.Moreover, symbolic execution has encountered many difficulties when applied todynamically typed languages like Python, because it is extremely challenging totranslate the flexible Python grammar into rigid solvers. To overcome the main challenges of applying symbolic execution in Python, weproposed an LLM-empowered agent, LLM-Sym, that automatically calls an SMTsolver, Z3, to solve execution path constraints. Based on an introductory-levelsymbolic execution engine, our LLM agent can extend it to supporting programswith complex data type `list'. The core contribution of LLM-Sym is translatingcomplex Python path constraints into Z3 code. To enable accurate path-to-Z3translation, we design a multiple-step code generation pipeline including typeinference, retrieval and self-refine. Our experiments demonstrate that LLM-Symis capable of solving path constraints on Leetcode problems with complicatedcontrol flows and list data structures, which is impossible for the backbonesymbolic execution engine. Our approach paves the way for the combination ofthe generation ability of LLMs with the reasoning ability of symbolic solvers,and opens up new opportunities in LLM-augmented test case generation.
符号执行是软件测试中的一项关键技术,它通过收集符号路径约束,然后用 SMT 求解器求解约束来生成测试用例。符号执行已被证明有助于生成高覆盖率的测试用例,但其局限性,如路径约束求解的困难,阻碍了它在软件测试中的广泛应用。此外,符号执行在应用于像 Python 这样的动态类型语言时也遇到了很多困难,因为将灵活的 Python 语法转换为刚性求解器是一项极具挑战性的工作。为了克服在 Python 中应用符号执行所面临的主要挑战,我们提出了一种由 LLM 驱动的代理 LLM-Sym,它可以自动调用 SMT 解算器 Z3 来解决执行路径约束。我们的 LLM 代理以入门级符号执行引擎为基础,可以扩展到支持具有复杂数据类型 "列表 "的程序。LLM-Sym 的核心贡献是将复杂的 Python 路径约束翻译成 Z3 代码。为了实现路径到 Z3 的精确翻译,我们设计了一个多步骤代码生成流水线,包括类型推理、检索和自我精炼。我们的实验证明,LLM-Sym 能够解决具有复杂控制流和列表数据结构的 Leetcode 问题的路径约束,而这是骨干符号执行引擎无法做到的。我们的方法为将 LLM 的生成能力与符号求解器的推理能力相结合铺平了道路,并为 LLM 增强测试用例生成开辟了新的机遇。
{"title":"Python Symbolic Execution with LLM-powered Code Generation","authors":"Wenhan Wang, Kaibo Liu, An Ran Chen, Ge Li, Zhi Jin, Gang Huang, Lei Ma","doi":"arxiv-2409.09271","DOIUrl":"https://doi.org/arxiv-2409.09271","url":null,"abstract":"Symbolic execution is a key technology in software testing, which generates\u0000test cases by collecting symbolic path constraints and then solving constraints\u0000with SMT solvers. Symbolic execution has been proven helpful in generating\u0000high-coverage test cases, but its limitations, e.g., the difficulties in\u0000solving path constraints, prevent it from broader usage in software testing.\u0000Moreover, symbolic execution has encountered many difficulties when applied to\u0000dynamically typed languages like Python, because it is extremely challenging to\u0000translate the flexible Python grammar into rigid solvers. To overcome the main challenges of applying symbolic execution in Python, we\u0000proposed an LLM-empowered agent, LLM-Sym, that automatically calls an SMT\u0000solver, Z3, to solve execution path constraints. Based on an introductory-level\u0000symbolic execution engine, our LLM agent can extend it to supporting programs\u0000with complex data type `list'. The core contribution of LLM-Sym is translating\u0000complex Python path constraints into Z3 code. To enable accurate path-to-Z3\u0000translation, we design a multiple-step code generation pipeline including type\u0000inference, retrieval and self-refine. Our experiments demonstrate that LLM-Sym\u0000is capable of solving path constraints on Leetcode problems with complicated\u0000control flows and list data structures, which is impossible for the backbone\u0000symbolic execution engine. Our approach paves the way for the combination of\u0000the generation ability of LLMs with the reasoning ability of symbolic solvers,\u0000and opens up new opportunities in LLM-augmented test case generation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs 模型即代码:衡量针对预训练模型中枢的恶意代码中毒攻击
Pub Date : 2024-09-14 DOI: arxiv-2409.09368
Jian Zhao, Shenao Wang, Yanjie Zhao, Xinyi Hou, Kailong Wang, Peiming Gao, Yuanchao Zhang, Chen Wei, Haoyu Wang
The proliferation of pre-trained models (PTMs) and datasets has led to theemergence of centralized model hubs like Hugging Face, which facilitatecollaborative development and reuse. However, recent security reports haveuncovered vulnerabilities and instances of malicious attacks within theseplatforms, highlighting growing security concerns. This paper presents thefirst systematic study of malicious code poisoning attacks on pre-trained modelhubs, focusing on the Hugging Face platform. We conduct a comprehensive threatanalysis, develop a taxonomy of model formats, and perform root cause analysisof vulnerable formats. While existing tools like Fickling and ModelScan offersome protection, they face limitations in semantic-level analysis andcomprehensive threat detection. To address these challenges, we propose MalHug,an end-to-end pipeline tailored for Hugging Face that combines dataset loadingscript extraction, model deserialization, in-depth taint analysis, andheuristic pattern matching to detect and classify malicious code poisoningattacks in datasets and models. In collaboration with Ant Group, a leadingfinancial technology company, we have implemented and deployed MalHug on amirrored Hugging Face instance within their infrastructure, where it has beenoperational for over three months. During this period, MalHug has monitoredmore than 705K models and 176K datasets, uncovering 91 malicious models and 9malicious dataset loading scripts. These findings reveal a range of securitythreats, including reverse shell, browser credential theft, and systemreconnaissance. This work not only bridges a critical gap in understanding thesecurity of the PTM supply chain but also provides a practical, industry-testedsolution for enhancing the security of pre-trained model hubs.
随着预训练模型(PTM)和数据集的激增,出现了像 "抱抱脸"(Hugging Face)这样的集中式模型中心,为协作开发和重复使用提供了便利。然而,最近的安全报告发现了这些平台的漏洞和恶意攻击实例,凸显了日益严重的安全问题。本文首次系统研究了针对预训练模型集群的恶意代码中毒攻击,重点关注 Hugging Face 平台。我们进行了全面的威胁分析,建立了模型格式分类法,并对易受攻击的格式进行了根本原因分析。虽然 Fickling 和 ModelScan 等现有工具提供了一定的保护,但它们在语义级分析和全面威胁检测方面存在局限性。为了应对这些挑战,我们提出了 MalHug,这是一个专为抱抱脸定制的端到端管道,它结合了数据集加载脚本提取、模型反序列化、深度污点分析和启发式模式匹配,以检测和分类数据集和模型中的恶意代码中毒攻击。我们与领先的金融技术公司蚂蚁金服集团(Ant Group)合作,在其基础架构中的irrored Hugging Face实例上实施并部署了MalHug,该实例已运行三个多月。在此期间,MalHug 监控了超过 705K 个模型和 176K 个数据集,发现了 91 个恶意模型和 9 个恶意数据集加载脚本。这些发现揭示了一系列安全威胁,包括反向外壳、浏览器凭证盗窃和系统反侦察。这项工作不仅弥补了在了解 PTM 供应链安全性方面的一个重要空白,还为增强预训练模型中心的安全性提供了一个实用的、经过行业测试的解决方案。
{"title":"Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs","authors":"Jian Zhao, Shenao Wang, Yanjie Zhao, Xinyi Hou, Kailong Wang, Peiming Gao, Yuanchao Zhang, Chen Wei, Haoyu Wang","doi":"arxiv-2409.09368","DOIUrl":"https://doi.org/arxiv-2409.09368","url":null,"abstract":"The proliferation of pre-trained models (PTMs) and datasets has led to the\u0000emergence of centralized model hubs like Hugging Face, which facilitate\u0000collaborative development and reuse. However, recent security reports have\u0000uncovered vulnerabilities and instances of malicious attacks within these\u0000platforms, highlighting growing security concerns. This paper presents the\u0000first systematic study of malicious code poisoning attacks on pre-trained model\u0000hubs, focusing on the Hugging Face platform. We conduct a comprehensive threat\u0000analysis, develop a taxonomy of model formats, and perform root cause analysis\u0000of vulnerable formats. While existing tools like Fickling and ModelScan offer\u0000some protection, they face limitations in semantic-level analysis and\u0000comprehensive threat detection. To address these challenges, we propose MalHug,\u0000an end-to-end pipeline tailored for Hugging Face that combines dataset loading\u0000script extraction, model deserialization, in-depth taint analysis, and\u0000heuristic pattern matching to detect and classify malicious code poisoning\u0000attacks in datasets and models. In collaboration with Ant Group, a leading\u0000financial technology company, we have implemented and deployed MalHug on a\u0000mirrored Hugging Face instance within their infrastructure, where it has been\u0000operational for over three months. During this period, MalHug has monitored\u0000more than 705K models and 176K datasets, uncovering 91 malicious models and 9\u0000malicious dataset loading scripts. These findings reveal a range of security\u0000threats, including reverse shell, browser credential theft, and system\u0000reconnaissance. This work not only bridges a critical gap in understanding the\u0000security of the PTM supply chain but also provides a practical, industry-tested\u0000solution for enhancing the security of pre-trained model hubs.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generating API Parameter Security Rules with LLM for API Misuse Detection 利用 LLM 生成 API 参数安全规则,用于 API 滥用检测
Pub Date : 2024-09-14 DOI: arxiv-2409.09288
Jinghua Liu, Yi Yang, Kai Chen, Miaoqian Lin
In this paper, we present a new framework, named GPTAid, for automatic APSRsgeneration by analyzing API source code with LLM and detecting API misusecaused by incorrect parameter use. To validate the correctness of theLLM-generated APSRs, we propose an execution feedback-checking approach basedon the observation that security-critical API misuse is often caused by APSRsviolations, and most of them result in runtime errors. Specifically, GPTAidfirst uses LLM to generate raw APSRs and the Right calling code, and thengenerates Violation code for each raw APSR by modifying the Right calling codeusing LLM. Subsequently, GPTAid performs dynamic execution on each piece ofViolation code and further filters out the incorrect APSRs based on runtimeerrors. To further generate concrete APSRs, GPTAid employs a code differentialanalysis to refine the filtered ones. Particularly, as the programming languageis more precise than natural language, GPTAid identifies the key operationswithin Violation code by differential analysis, and then generates thecorresponding concrete APSR based on the aforementioned operations. Theseconcrete APSRs could be precisely interpreted into applicable detection code,which proven to be effective in API misuse detection. Implementing on thedataset containing 200 randomly selected APIs from eight popular libraries,GPTAid achieves a precision of 92.3%. Moreover, it generates 6 times more APSRsthan state-of-the-art detectors on a comparison dataset of previously reportedbugs and APSRs. We further evaluated GPTAid on 47 applications, 210 unknownsecurity bugs were found potentially resulting in severe security issues (e.g.,system crashes), 150 of which have been confirmed by developers after ourreports.
在本文中,我们提出了一个名为 GPTAid 的新框架,通过使用 LLM 分析 API 源代码并检测因参数使用不当而导致的 API 误用,从而自动生成 APSR。为了验证 LLM 生成的 APSR 的正确性,我们提出了一种执行反馈检查方法,该方法基于以下观察:安全关键 API 的误用通常是由违反 APSR 引起的,其中大多数会导致运行时错误。具体来说,GPTAid 首先使用 LLM 生成原始 APSR 和正确调用代码,然后通过使用 LLM 修改正确调用代码为每个原始 APSR 生成违规代码。随后,GPTAid 对每段违规代码执行动态执行,并根据运行时错误进一步过滤出错误的 APSR。为了进一步生成具体的 APSR,GPTAid 采用了代码差异分析法来完善过滤后的 APSR。特别是,由于编程语言比自然语言更加精确,GPTAid 通过差分分析识别出违规代码中的关键操作,然后根据上述操作生成相应的具体 APSR。这些具体的 APSR 可以被精确地解释为适用的检测代码,在 API 滥用检测中被证明是有效的。GPTAid 在包含从八个流行库中随机抽取的 200 个 API 的数据集上实现了 92.3% 的精确度。此外,在以前报道过的漏洞和APSR的对比数据集上,它生成的APSR是最先进检测器的6倍。我们在 47 个应用程序上对 GPTAid 进行了进一步评估,发现了 210 个可能导致严重安全问题(如系统崩溃)的未知安全漏洞,其中 150 个漏洞在我们报告后得到了开发人员的确认。
{"title":"Generating API Parameter Security Rules with LLM for API Misuse Detection","authors":"Jinghua Liu, Yi Yang, Kai Chen, Miaoqian Lin","doi":"arxiv-2409.09288","DOIUrl":"https://doi.org/arxiv-2409.09288","url":null,"abstract":"In this paper, we present a new framework, named GPTAid, for automatic APSRs\u0000generation by analyzing API source code with LLM and detecting API misuse\u0000caused by incorrect parameter use. To validate the correctness of the\u0000LLM-generated APSRs, we propose an execution feedback-checking approach based\u0000on the observation that security-critical API misuse is often caused by APSRs\u0000violations, and most of them result in runtime errors. Specifically, GPTAid\u0000first uses LLM to generate raw APSRs and the Right calling code, and then\u0000generates Violation code for each raw APSR by modifying the Right calling code\u0000using LLM. Subsequently, GPTAid performs dynamic execution on each piece of\u0000Violation code and further filters out the incorrect APSRs based on runtime\u0000errors. To further generate concrete APSRs, GPTAid employs a code differential\u0000analysis to refine the filtered ones. Particularly, as the programming language\u0000is more precise than natural language, GPTAid identifies the key operations\u0000within Violation code by differential analysis, and then generates the\u0000corresponding concrete APSR based on the aforementioned operations. These\u0000concrete APSRs could be precisely interpreted into applicable detection code,\u0000which proven to be effective in API misuse detection. Implementing on the\u0000dataset containing 200 randomly selected APIs from eight popular libraries,\u0000GPTAid achieves a precision of 92.3%. Moreover, it generates 6 times more APSRs\u0000than state-of-the-art detectors on a comparison dataset of previously reported\u0000bugs and APSRs. We further evaluated GPTAid on 47 applications, 210 unknown\u0000security bugs were found potentially resulting in severe security issues (e.g.,\u0000system crashes), 150 of which have been confirmed by developers after our\u0000reports.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Agents in Software Engineering: Survey, Landscape, and Vision 软件工程中的代理:调查、景观和愿景
Pub Date : 2024-09-13 DOI: arxiv-2409.09030
Yanxian Huang, Wanjun Zhong, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, Zibin Zheng, Yanlin Wang
In recent years, Large Language Models (LLMs) have achieved remarkablesuccess and have been widely used in various downstream tasks, especially inthe tasks of the software engineering (SE) field. We find that many studiescombining LLMs with SE have employed the concept of agents either explicitly orimplicitly. However, there is a lack of an in-depth survey to sort out thedevelopment context of existing works, analyze how existing works combine theLLM-based agent technologies to optimize various tasks, and clarify theframework of LLM-based agents in SE. In this paper, we conduct the first surveyof the studies on combining LLM-based agents with SE and present a framework ofLLM-based agents in SE which includes three key modules: perception, memory,and action. We also summarize the current challenges in combining the twofields and propose future opportunities in response to existing challenges. Wemaintain a GitHub repository of the related papers at:https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.
近年来,大型语言模型(LLM)取得了显著的成就,并被广泛应用于各种下游任务,尤其是软件工程(SE)领域的任务。我们发现,许多将 LLM 与 SE 结合起来的研究都或明或暗地使用了代理的概念。然而,目前还缺乏深入的调查来梳理现有研究的发展脉络,分析现有研究如何结合基于 LLM 的代理技术来优化各种任务,并阐明基于 LLM 的代理在 SE 中的框架。在本文中,我们首次对基于 LLM 的代理与 SE 的结合研究进行了调查,并提出了基于 LLM 的代理在 SE 中的框架,其中包括三个关键模块:感知、记忆和行动。我们还总结了当前将这两个领域结合起来所面临的挑战,并针对现有挑战提出了未来的机遇。我们在 GitHub 上建立了一个相关论文库:https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE。
{"title":"Agents in Software Engineering: Survey, Landscape, and Vision","authors":"Yanxian Huang, Wanjun Zhong, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, Zibin Zheng, Yanlin Wang","doi":"arxiv-2409.09030","DOIUrl":"https://doi.org/arxiv-2409.09030","url":null,"abstract":"In recent years, Large Language Models (LLMs) have achieved remarkable\u0000success and have been widely used in various downstream tasks, especially in\u0000the tasks of the software engineering (SE) field. We find that many studies\u0000combining LLMs with SE have employed the concept of agents either explicitly or\u0000implicitly. However, there is a lack of an in-depth survey to sort out the\u0000development context of existing works, analyze how existing works combine the\u0000LLM-based agent technologies to optimize various tasks, and clarify the\u0000framework of LLM-based agents in SE. In this paper, we conduct the first survey\u0000of the studies on combining LLM-based agents with SE and present a framework of\u0000LLM-based agents in SE which includes three key modules: perception, memory,\u0000and action. We also summarize the current challenges in combining the two\u0000fields and propose future opportunities in response to existing challenges. We\u0000maintain a GitHub repository of the related papers at:\u0000https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Empirical Analysis of Git Commit Logs for Potential Inconsistency in Code Clones 对 Git 提交日志进行实证分析,发现代码克隆中潜在的不一致性
Pub Date : 2024-09-13 DOI: arxiv-2409.08555
Reishi Yokomori, Katsuro Inoue
Code clones are code snippets that are identical or similar to other snippetswithin the same or different files. They are often created throughcopy-and-paste practices and modified during development and maintenanceactivities. Since a pair of code clones, known as a clone pair, has a possiblelogical coupling between them, it is expected that changes to each snippet aremade simultaneously (co-changed) and consistently. There is extensive researchon code clones, including studies related to the co-change of clones; however,detailed analysis of commit logs for code clone pairs has been limited. In this paper, we investigate the commit logs of code snippets from clonepairs, using the git-log command to extract changes to cloned code snippets. Weanalyzed 45 repositories owned by the Apache Software Foundation on GitHub andaddressed three research questions regarding commit frequency, co-change ratio,and commit patterns. Our findings indicate that (1) on average, clone snippetsare changed infrequently, typically only two or three times throughout theirlifetime, (2) the ratio of co-changes is about half of all clone changes, with10-20% of co-changed commits being concerning (potentially inconsistent), and(3) 35-65% of all clone pairs being classified as concerning clone pairs(potentially inconsistent clone pairs). These results suggest the need for aconsistent management system through the commit timeline of clones.
代码克隆是指与相同或不同文件中的其他代码片段相同或相似的代码片段。它们通常通过复制粘贴的方式创建,并在开发和维护活动中进行修改。由于一对代码克隆(称为克隆对)之间可能存在逻辑耦合,因此对每个代码片段的修改应同时进行(共同修改)并保持一致。有关代码克隆的研究非常广泛,其中包括与克隆的共同变更相关的研究;但是,对代码克隆对的提交日志进行详细分析的研究还很有限。在本文中,我们使用 git-log 命令提取克隆代码片段的变更,研究了克隆对中代码片段的提交日志。我们分析了 GitHub 上阿帕奇软件基金会(Apache Software Foundation)拥有的 45 个版本库,并探讨了有关提交频率、共变比率和提交模式的三个研究问题。我们的研究结果表明:(1) 克隆代码片段的平均变更频率很低,通常在其整个生命周期内只变更两到三次;(2) 共同变更的比例约为所有克隆变更的一半,其中 10-20% 的共同变更提交为相关提交(潜在不一致提交);(3) 35-65% 的克隆对被归类为相关克隆对(潜在不一致克隆对)。这些结果表明,需要一个贯穿克隆提交时间线的一致性管理系统。
{"title":"An Empirical Analysis of Git Commit Logs for Potential Inconsistency in Code Clones","authors":"Reishi Yokomori, Katsuro Inoue","doi":"arxiv-2409.08555","DOIUrl":"https://doi.org/arxiv-2409.08555","url":null,"abstract":"Code clones are code snippets that are identical or similar to other snippets\u0000within the same or different files. They are often created through\u0000copy-and-paste practices and modified during development and maintenance\u0000activities. Since a pair of code clones, known as a clone pair, has a possible\u0000logical coupling between them, it is expected that changes to each snippet are\u0000made simultaneously (co-changed) and consistently. There is extensive research\u0000on code clones, including studies related to the co-change of clones; however,\u0000detailed analysis of commit logs for code clone pairs has been limited. In this paper, we investigate the commit logs of code snippets from clone\u0000pairs, using the git-log command to extract changes to cloned code snippets. We\u0000analyzed 45 repositories owned by the Apache Software Foundation on GitHub and\u0000addressed three research questions regarding commit frequency, co-change ratio,\u0000and commit patterns. Our findings indicate that (1) on average, clone snippets\u0000are changed infrequently, typically only two or three times throughout their\u0000lifetime, (2) the ratio of co-changes is about half of all clone changes, with\u000010-20% of co-changed commits being concerning (potentially inconsistent), and\u0000(3) 35-65% of all clone pairs being classified as concerning clone pairs\u0000(potentially inconsistent clone pairs). These results suggest the need for a\u0000consistent management system through the commit timeline of clones.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Graph-based Patch Representations for Identifying and Assessing Silent Vulnerability Fixes 学习基于图形的补丁表示法以识别和评估无声漏洞修复
Pub Date : 2024-09-13 DOI: arxiv-2409.08512
Mei Han, Lulu Wang, Jianming Chang, Bixin Li, Chunguang Zhang
Software projects are dependent on many third-party libraries, thereforehigh-risk vulnerabilities can propagate through the dependency chain todownstream projects. Owing to the subjective nature of patch management,software vendors commonly fix vulnerabilities silently. Silent vulnerabilityfixes cause downstream software to be unaware of urgent security issues in atimely manner, posing a security risk to the software. Presently, most of theexisting works for vulnerability fix identification only consider the changedcode as a sequential textual sequence, ignoring the structural information ofthe code. In this paper, we propose GRAPE, a GRAph-based Patch rEpresentationthat aims to 1) provide a unified framework for getting vulnerability fixpatches representation; and 2) enhance the understanding of the intent andpotential impact of patches by extracting structural information of the code.GRAPE employs a novel joint graph structure (MCPG) to represent the syntacticand semantic information of fix patches and embeds both nodes and edges.Subsequently, a carefully designed graph convolutional neural network (NE-GCN)is utilized to fully learn structural features by leveraging the attributes ofthe nodes and edges. Moreover, we construct a dataset containing 2251 silentfixes. For the experimental section, we evaluated patch representation on threetasks, including vulnerability fix identification, vulnerability typesclassification, and vulnerability severity classification. Experimental resultsindicate that, in comparison to baseline methods, GRAPE can more effectivelyreduce false positives and omissions of vulnerability fixes identification andprovide accurate vulnerability assessments.
软件项目依赖于许多第三方库,因此高风险漏洞会通过依赖链向下游项目传播。由于补丁管理的主观性,软件供应商通常会默默修复漏洞。静默修复漏洞会导致下游软件无法及时发现紧急安全问题,从而给软件带来安全风险。目前,大多数用于漏洞修复识别的现有工作都只将改变的代码视为连续的文本序列,而忽略了代码的结构信息。在本文中,我们提出了基于 GRAph 的补丁表示法 GRAPE,其目的是:1)为获取漏洞修复补丁表示法提供一个统一的框架;2)通过提取代码的结构信息,增强对补丁意图和潜在影响的理解。GRAPE 采用一种新颖的联合图结构(MCPG)来表示漏洞补丁的语法和语义信息,并同时嵌入节点和边。此外,我们还构建了一个包含 2251 个静音修复的数据集。在实验部分,我们对三个任务的补丁表示进行了评估,包括漏洞修复识别、漏洞类型分类和漏洞严重性分类。实验结果表明,与基线方法相比,GRAPE 能更有效地减少漏洞修复识别的误报和漏报,并提供准确的漏洞评估。
{"title":"Learning Graph-based Patch Representations for Identifying and Assessing Silent Vulnerability Fixes","authors":"Mei Han, Lulu Wang, Jianming Chang, Bixin Li, Chunguang Zhang","doi":"arxiv-2409.08512","DOIUrl":"https://doi.org/arxiv-2409.08512","url":null,"abstract":"Software projects are dependent on many third-party libraries, therefore\u0000high-risk vulnerabilities can propagate through the dependency chain to\u0000downstream projects. Owing to the subjective nature of patch management,\u0000software vendors commonly fix vulnerabilities silently. Silent vulnerability\u0000fixes cause downstream software to be unaware of urgent security issues in a\u0000timely manner, posing a security risk to the software. Presently, most of the\u0000existing works for vulnerability fix identification only consider the changed\u0000code as a sequential textual sequence, ignoring the structural information of\u0000the code. In this paper, we propose GRAPE, a GRAph-based Patch rEpresentation\u0000that aims to 1) provide a unified framework for getting vulnerability fix\u0000patches representation; and 2) enhance the understanding of the intent and\u0000potential impact of patches by extracting structural information of the code.\u0000GRAPE employs a novel joint graph structure (MCPG) to represent the syntactic\u0000and semantic information of fix patches and embeds both nodes and edges.\u0000Subsequently, a carefully designed graph convolutional neural network (NE-GCN)\u0000is utilized to fully learn structural features by leveraging the attributes of\u0000the nodes and edges. Moreover, we construct a dataset containing 2251 silent\u0000fixes. For the experimental section, we evaluated patch representation on three\u0000tasks, including vulnerability fix identification, vulnerability types\u0000classification, and vulnerability severity classification. Experimental results\u0000indicate that, in comparison to baseline methods, GRAPE can more effectively\u0000reduce false positives and omissions of vulnerability fixes identification and\u0000provide accurate vulnerability assessments.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Are Existing Road Design Guidelines Suitable for Autonomous Vehicles? 现有道路设计指南是否适合自动驾驶汽车?
Pub Date : 2024-09-13 DOI: arxiv-2409.10562
Yang Sun, Christopher M. Poskitt, Jun Sun
The emergence of Autonomous Vehicles (AVs) has spurred research into testingthe resilience of their perception systems, i.e. to ensure they are notsusceptible to making critical misjudgements. It is important that they aretested not only with respect to other vehicles on the road, but also thoseobjects placed on the roadside. Trash bins, billboards, and greenery are allexamples of such objects, typically placed according to guidelines that weredeveloped for the human visual system, and which may not align perfectly withthe needs of AVs. Existing tests, however, usually focus on adversarial objectswith conspicuous shapes/patches, that are ultimately unrealistic given theirunnatural appearances and the need for white box knowledge. In this work, weintroduce a black box attack on the perception systems of AVs, in which theobjective is to create realistic adversarial scenarios (i.e. satisfying roaddesign guidelines) by manipulating the positions of common roadside objects,and without resorting to `unnatural' adversarial patches. In particular, wepropose TrashFuzz , a fuzzing algorithm to find scenarios in which theplacement of these objects leads to substantial misperceptions by the AV --such as mistaking a traffic light's colour -- with overall the goal of causingit to violate traffic laws. To ensure the realism of these scenarios, they mustsatisfy several rules encoding regulatory guidelines about the placement ofobjects on public streets. We implemented and evaluated these attacks for theApollo, finding that TrashFuzz induced it into violating 15 out of 24 differenttraffic laws.
自动驾驶汽车(AV)的出现促使人们开始研究测试其感知系统的适应能力,即确保它们不会做出关键的错误判断。重要的是,不仅要对道路上的其他车辆进行测试,还要对路边的物体进行测试。垃圾箱、广告牌和绿化带就是这类物体的典型例子,它们通常是根据为人类视觉系统开发的准则摆放的,而这些准则可能并不完全符合自动驾驶汽车的需求。然而,现有的测试通常侧重于具有明显形状/斑块的对抗性物体,鉴于其不自然的外观和对白盒知识的需求,这种测试最终是不现实的。在这项工作中,我们引入了对自动驾驶汽车感知系统的黑盒攻击,其目标是通过操纵常见路边物体的位置来创建逼真的对抗场景(即满足道路设计指南),而无需借助 "不自然 "的对抗补丁。具体而言,我们提出了 TrashFuzz,这是一种模糊算法,用于寻找这些物体的位置会导致自动驾驶汽车产生重大误解的场景--例如误认为交通信号灯的颜色--其总体目标是导致自动驾驶汽车违反交通法规。为了确保这些场景的真实性,它们必须满足若干规则,这些规则编码了公共街道上物体摆放的监管准则。我们对阿波罗实施并评估了这些攻击,发现 TrashFuzz 能诱使它违反 24 项不同交通法规中的 15 项。
{"title":"Are Existing Road Design Guidelines Suitable for Autonomous Vehicles?","authors":"Yang Sun, Christopher M. Poskitt, Jun Sun","doi":"arxiv-2409.10562","DOIUrl":"https://doi.org/arxiv-2409.10562","url":null,"abstract":"The emergence of Autonomous Vehicles (AVs) has spurred research into testing\u0000the resilience of their perception systems, i.e. to ensure they are not\u0000susceptible to making critical misjudgements. It is important that they are\u0000tested not only with respect to other vehicles on the road, but also those\u0000objects placed on the roadside. Trash bins, billboards, and greenery are all\u0000examples of such objects, typically placed according to guidelines that were\u0000developed for the human visual system, and which may not align perfectly with\u0000the needs of AVs. Existing tests, however, usually focus on adversarial objects\u0000with conspicuous shapes/patches, that are ultimately unrealistic given their\u0000unnatural appearances and the need for white box knowledge. In this work, we\u0000introduce a black box attack on the perception systems of AVs, in which the\u0000objective is to create realistic adversarial scenarios (i.e. satisfying road\u0000design guidelines) by manipulating the positions of common roadside objects,\u0000and without resorting to `unnatural' adversarial patches. In particular, we\u0000propose TrashFuzz , a fuzzing algorithm to find scenarios in which the\u0000placement of these objects leads to substantial misperceptions by the AV --\u0000such as mistaking a traffic light's colour -- with overall the goal of causing\u0000it to violate traffic laws. To ensure the realism of these scenarios, they must\u0000satisfy several rules encoding regulatory guidelines about the placement of\u0000objects on public streets. We implemented and evaluated these attacks for the\u0000Apollo, finding that TrashFuzz induced it into violating 15 out of 24 different\u0000traffic laws.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diagnosis via Proofs of Unsatisfiability for First-Order Logic with Relational Objects 通过关系对象一阶逻辑的不可满足性证明进行诊断
Pub Date : 2024-09-13 DOI: arxiv-2409.09223
Nick Feng, Lina Marsso, Marsha Chechik
Satisfiability-based automated reasoning is an approach that is beingsuccessfully used in software engineering to validate complex software,including for safety-critical systems. Such reasoning underlies many validationactivities, from requirements analysis to design consistency to test coverage.While generally effective, the back-end constraint solvers are often complexand inevitably error-prone, which threatens the soundness of their application.Thus, such solvers need to be validated, which includes checking correctnessand explaining (un)satisfiability results returned by them. In this work, weconsider satisfiability analysis based on First-Order Logic with relationalobjects (FOL*) which has been shown to be effective for reasoning about time-and data-sensitive early system designs. We tackle the challenge of validatingthe correctness of FOL* unsatisfiability results and deriving diagnoses toexplain the causes of the unsatisfiability. Inspired by the concept of proofsof UNSAT from SAT/SMT solvers, we define a proof format and proof rules totrack the solvers' reasoning steps as sequences of derivations towards UNSAT.We also propose an algorithm to verify the correctness of FOL* proofs whilefiltering unnecessary derivations and develop a proof-based diagnosis toexplain the cause of unsatisfiability. We implemented the proposed proofsupport on top of the state-of-the-art FOL* satisfiability checker to generateproofs of UNSAT and validated our approach by applying the proof-baseddiagnoses to explain the causes of well-formedness issues of normativerequirements of software systems.
基于可满足性的自动推理是一种在软件工程中被成功用于验证复杂软件(包括安全关键型系统)的方法。这种推理是许多验证活动的基础,从需求分析到设计一致性,再到测试覆盖范围。虽然总体上是有效的,但后端约束求解器往往很复杂,而且不可避免地容易出错,这威胁到其应用的合理性。因此,需要对这类求解器进行验证,包括检查其正确性并解释其返回的(不)可满足性结果。在这项工作中,我们考虑了基于关系对象的一阶逻辑(FOL*)的可满足性分析,该方法已被证明对推理时间和数据敏感的早期系统设计非常有效。我们面临的挑战是验证 FOL* 不可满足性结果的正确性,并得出诊断结果以解释不可满足性的原因。受 SAT/SMT 求解器的 UNSAT 证明概念的启发,我们定义了一种证明格式和证明规则,以跟踪求解器的推理步骤,将其作为实现 UNSAT 的推导序列。我们还提出了一种算法,用于验证 FOL* 证明的正确性,同时过滤不必要的推导,并开发了一种基于证明的诊断方法来解释不可满足性的原因。我们在最先进的FOL*可满足性检查器的基础上实现了所提出的证明支持,生成了UNSAT的证明,并通过应用基于证明的诊断来解释软件系统规范要求的良好形成性问题的原因,从而验证了我们的方法。
{"title":"Diagnosis via Proofs of Unsatisfiability for First-Order Logic with Relational Objects","authors":"Nick Feng, Lina Marsso, Marsha Chechik","doi":"arxiv-2409.09223","DOIUrl":"https://doi.org/arxiv-2409.09223","url":null,"abstract":"Satisfiability-based automated reasoning is an approach that is being\u0000successfully used in software engineering to validate complex software,\u0000including for safety-critical systems. Such reasoning underlies many validation\u0000activities, from requirements analysis to design consistency to test coverage.\u0000While generally effective, the back-end constraint solvers are often complex\u0000and inevitably error-prone, which threatens the soundness of their application.\u0000Thus, such solvers need to be validated, which includes checking correctness\u0000and explaining (un)satisfiability results returned by them. In this work, we\u0000consider satisfiability analysis based on First-Order Logic with relational\u0000objects (FOL*) which has been shown to be effective for reasoning about time-\u0000and data-sensitive early system designs. We tackle the challenge of validating\u0000the correctness of FOL* unsatisfiability results and deriving diagnoses to\u0000explain the causes of the unsatisfiability. Inspired by the concept of proofs\u0000of UNSAT from SAT/SMT solvers, we define a proof format and proof rules to\u0000track the solvers' reasoning steps as sequences of derivations towards UNSAT.\u0000We also propose an algorithm to verify the correctness of FOL* proofs while\u0000filtering unnecessary derivations and develop a proof-based diagnosis to\u0000explain the cause of unsatisfiability. We implemented the proposed proof\u0000support on top of the state-of-the-art FOL* satisfiability checker to generate\u0000proofs of UNSAT and validated our approach by applying the proof-based\u0000diagnoses to explain the causes of well-formedness issues of normative\u0000requirements of software systems.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Modified Condition/Decision Coverage of Rust 实现修改后的锈蚀条件/决定覆盖范围
Pub Date : 2024-09-13 DOI: arxiv-2409.08708
Wanja Zaeske, Pietro Albini, Florian Gilcher, Umut Durak
Testing is an essential tool to assure software, especially so insafety-critical applications. To quantify how thoroughly a software item hasbeen tested, a test coverage metric is required. Maybe the strictest suchmetric known in the safety critical systems is Modified Condition/DecisionCoverage (MC/DC), which DO-178C prescribes for the highest software assurancelevel in aviation. In the past, ambiguities in the interpretation of MC/DC havebeen resolved already, i. e. in CAST-10. However, some central features of theRust programming language necessitate further clarification. This workinvestigates aforementioned features, in particular pattern matching, providinga consistent view on how to apply MC/DC to Rust. Hence, this paper informs theimplementation of Rust MC/DC tools, paving the road towards Rust inhigh-assurance applications.
测试是保证软件质量的重要工具,在安全关键型应用中尤其如此。为了量化软件项目的测试彻底程度,需要一个测试覆盖率指标。在安全关键型系统中,最严格的此类指标可能是修正条件/决策覆盖率(MC/DC),DO-178C 规定了航空领域最高的软件保证级别。过去,MC/DC 解释中的模糊之处已经在 CAST-10 中得到解决。然而,Rust 编程语言的一些核心特征需要进一步澄清。本文对上述特征,尤其是模式匹配进行了研究,为如何将 MC/DC 应用于 Rust 提供了一致的观点。因此,本文为 Rust MC/DC 工具的实施提供了参考,为 Rust 融入高保证应用铺平了道路。
{"title":"Towards Modified Condition/Decision Coverage of Rust","authors":"Wanja Zaeske, Pietro Albini, Florian Gilcher, Umut Durak","doi":"arxiv-2409.08708","DOIUrl":"https://doi.org/arxiv-2409.08708","url":null,"abstract":"Testing is an essential tool to assure software, especially so in\u0000safety-critical applications. To quantify how thoroughly a software item has\u0000been tested, a test coverage metric is required. Maybe the strictest such\u0000metric known in the safety critical systems is Modified Condition/Decision\u0000Coverage (MC/DC), which DO-178C prescribes for the highest software assurance\u0000level in aviation. In the past, ambiguities in the interpretation of MC/DC have\u0000been resolved already, i. e. in CAST-10. However, some central features of the\u0000Rust programming language necessitate further clarification. This work\u0000investigates aforementioned features, in particular pattern matching, providing\u0000a consistent view on how to apply MC/DC to Rust. Hence, this paper informs the\u0000implementation of Rust MC/DC tools, paving the road towards Rust in\u0000high-assurance applications.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests B4:通过可信测试实现对可信代码解决方案的最佳评估
Pub Date : 2024-09-13 DOI: arxiv-2409.08692
Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun
Selecting the best code solution from multiple generated ones is an essentialtask in code generation, which can be achieved by using some reliablevalidators (e.g., developer-written test cases) for assistance. Since reliabletest cases are not always available and can be expensive to build in practice,researchers propose to automatically generate test cases to assess codesolutions. However, when both code solutions and test cases are plausible andnot reliable, selecting the best solution becomes challenging. Although someheuristic strategies have been proposed to tackle this problem, they lack astrong theoretical guarantee and it is still an open question whether anoptimal selection strategy exists. Our work contributes in two ways. First, weshow that within a Bayesian framework, the optimal selection strategy can bedefined based on the posterior probability of the observed passing statesbetween solutions and tests. The problem of identifying the best solution isthen framed as an integer programming problem. Second, we propose an efficientapproach for approximating this optimal (yet uncomputable) strategy, where theapproximation error is bounded by the correctness of prior knowledge. We thenincorporate effective prior knowledge to tailor code generation tasks. Boththeoretical and empirical studies confirm that existing heuristics are limitedin selecting the best solutions with plausible test cases. Our proposedapproximated optimal strategy B4 significantly surpasses existing heuristics inselecting code solutions generated by large language models (LLMs) withLLM-generated tests, achieving a relative performance improvement by up to 50%over the strongest heuristic and 246% over the random selection in the mostchallenging scenarios. Our code is publicly available athttps://github.com/ZJU-CTAG/B4.
在代码生成过程中,从多个生成的代码方案中选择最佳代码方案是一项基本任务,这可以通过使用一些可靠的验证器(如开发人员编写的测试用例)来实现。由于可靠的测试用例并不总是可用,而且在实践中构建成本可能很高,因此研究人员建议自动生成测试用例来评估代码解决方案。然而,当代码解决方案和测试用例都似是而非且不可靠时,选择最佳解决方案就变得非常具有挑战性。虽然已经提出了一些启发式策略来解决这个问题,但它们缺乏有力的理论保证,而且是否存在最佳选择策略仍是一个未决问题。我们的工作在两个方面做出了贡献。首先,我们展示了在贝叶斯框架内,可以根据观察到的解决方案和测试之间的传递状态的后验概率来定义最佳选择策略。这样,确定最佳解决方案的问题就被归结为一个整数编程问题。其次,我们提出了近似这一最优(但不可计算)策略的有效方法,近似误差受先验知识正确性的限制。然后,我们结合有效的先验知识来定制代码生成任务。理论和实证研究都证实,现有的启发式方法在选择具有可信测试用例的最佳解决方案方面存在局限性。我们提出的近似最优策略 B4 在选择由大语言模型(LLM)生成的代码解决方案时大大超越了现有的启发式方法,在最具挑战性的场景中,比最强启发式方法的相对性能提高了 50%,比随机选择方法的性能提高了 246%。我们的代码可在https://github.com/ZJU-CTAG/B4。
{"title":"B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests","authors":"Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun","doi":"arxiv-2409.08692","DOIUrl":"https://doi.org/arxiv-2409.08692","url":null,"abstract":"Selecting the best code solution from multiple generated ones is an essential\u0000task in code generation, which can be achieved by using some reliable\u0000validators (e.g., developer-written test cases) for assistance. Since reliable\u0000test cases are not always available and can be expensive to build in practice,\u0000researchers propose to automatically generate test cases to assess code\u0000solutions. However, when both code solutions and test cases are plausible and\u0000not reliable, selecting the best solution becomes challenging. Although some\u0000heuristic strategies have been proposed to tackle this problem, they lack a\u0000strong theoretical guarantee and it is still an open question whether an\u0000optimal selection strategy exists. Our work contributes in two ways. First, we\u0000show that within a Bayesian framework, the optimal selection strategy can be\u0000defined based on the posterior probability of the observed passing states\u0000between solutions and tests. The problem of identifying the best solution is\u0000then framed as an integer programming problem. Second, we propose an efficient\u0000approach for approximating this optimal (yet uncomputable) strategy, where the\u0000approximation error is bounded by the correctness of prior knowledge. We then\u0000incorporate effective prior knowledge to tailor code generation tasks. Both\u0000theoretical and empirical studies confirm that existing heuristics are limited\u0000in selecting the best solutions with plausible test cases. Our proposed\u0000approximated optimal strategy B4 significantly surpasses existing heuristics in\u0000selecting code solutions generated by large language models (LLMs) with\u0000LLM-generated tests, achieving a relative performance improvement by up to 50%\u0000over the strongest heuristic and 246% over the random selection in the most\u0000challenging scenarios. Our code is publicly available at\u0000https://github.com/ZJU-CTAG/B4.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Software Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1