Wenhan Wang, Kaibo Liu, An Ran Chen, Ge Li, Zhi Jin, Gang Huang, Lei Ma
Symbolic execution is a key technology in software testing, which generates test cases by collecting symbolic path constraints and then solving constraints with SMT solvers. Symbolic execution has been proven helpful in generating high-coverage test cases, but its limitations, e.g., the difficulties in solving path constraints, prevent it from broader usage in software testing. Moreover, symbolic execution has encountered many difficulties when applied to dynamically typed languages like Python, because it is extremely challenging to translate the flexible Python grammar into rigid solvers. To overcome the main challenges of applying symbolic execution in Python, we proposed an LLM-empowered agent, LLM-Sym, that automatically calls an SMT solver, Z3, to solve execution path constraints. Based on an introductory-level symbolic execution engine, our LLM agent can extend it to supporting programs with complex data type `list'. The core contribution of LLM-Sym is translating complex Python path constraints into Z3 code. To enable accurate path-to-Z3 translation, we design a multiple-step code generation pipeline including type inference, retrieval and self-refine. Our experiments demonstrate that LLM-Sym is capable of solving path constraints on Leetcode problems with complicated control flows and list data structures, which is impossible for the backbone symbolic execution engine. Our approach paves the way for the combination of the generation ability of LLMs with the reasoning ability of symbolic solvers, and opens up new opportunities in LLM-augmented test case generation.
{"title":"Python Symbolic Execution with LLM-powered Code Generation","authors":"Wenhan Wang, Kaibo Liu, An Ran Chen, Ge Li, Zhi Jin, Gang Huang, Lei Ma","doi":"arxiv-2409.09271","DOIUrl":"https://doi.org/arxiv-2409.09271","url":null,"abstract":"Symbolic execution is a key technology in software testing, which generates\u0000test cases by collecting symbolic path constraints and then solving constraints\u0000with SMT solvers. Symbolic execution has been proven helpful in generating\u0000high-coverage test cases, but its limitations, e.g., the difficulties in\u0000solving path constraints, prevent it from broader usage in software testing.\u0000Moreover, symbolic execution has encountered many difficulties when applied to\u0000dynamically typed languages like Python, because it is extremely challenging to\u0000translate the flexible Python grammar into rigid solvers. To overcome the main challenges of applying symbolic execution in Python, we\u0000proposed an LLM-empowered agent, LLM-Sym, that automatically calls an SMT\u0000solver, Z3, to solve execution path constraints. Based on an introductory-level\u0000symbolic execution engine, our LLM agent can extend it to supporting programs\u0000with complex data type `list'. The core contribution of LLM-Sym is translating\u0000complex Python path constraints into Z3 code. To enable accurate path-to-Z3\u0000translation, we design a multiple-step code generation pipeline including type\u0000inference, retrieval and self-refine. Our experiments demonstrate that LLM-Sym\u0000is capable of solving path constraints on Leetcode problems with complicated\u0000control flows and list data structures, which is impossible for the backbone\u0000symbolic execution engine. Our approach paves the way for the combination of\u0000the generation ability of LLMs with the reasoning ability of symbolic solvers,\u0000and opens up new opportunities in LLM-augmented test case generation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The proliferation of pre-trained models (PTMs) and datasets has led to the emergence of centralized model hubs like Hugging Face, which facilitate collaborative development and reuse. However, recent security reports have uncovered vulnerabilities and instances of malicious attacks within these platforms, highlighting growing security concerns. This paper presents the first systematic study of malicious code poisoning attacks on pre-trained model hubs, focusing on the Hugging Face platform. We conduct a comprehensive threat analysis, develop a taxonomy of model formats, and perform root cause analysis of vulnerable formats. While existing tools like Fickling and ModelScan offer some protection, they face limitations in semantic-level analysis and comprehensive threat detection. To address these challenges, we propose MalHug, an end-to-end pipeline tailored for Hugging Face that combines dataset loading script extraction, model deserialization, in-depth taint analysis, and heuristic pattern matching to detect and classify malicious code poisoning attacks in datasets and models. In collaboration with Ant Group, a leading financial technology company, we have implemented and deployed MalHug on a mirrored Hugging Face instance within their infrastructure, where it has been operational for over three months. During this period, MalHug has monitored more than 705K models and 176K datasets, uncovering 91 malicious models and 9 malicious dataset loading scripts. These findings reveal a range of security threats, including reverse shell, browser credential theft, and system reconnaissance. This work not only bridges a critical gap in understanding the security of the PTM supply chain but also provides a practical, industry-tested solution for enhancing the security of pre-trained model hubs.
{"title":"Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs","authors":"Jian Zhao, Shenao Wang, Yanjie Zhao, Xinyi Hou, Kailong Wang, Peiming Gao, Yuanchao Zhang, Chen Wei, Haoyu Wang","doi":"arxiv-2409.09368","DOIUrl":"https://doi.org/arxiv-2409.09368","url":null,"abstract":"The proliferation of pre-trained models (PTMs) and datasets has led to the\u0000emergence of centralized model hubs like Hugging Face, which facilitate\u0000collaborative development and reuse. However, recent security reports have\u0000uncovered vulnerabilities and instances of malicious attacks within these\u0000platforms, highlighting growing security concerns. This paper presents the\u0000first systematic study of malicious code poisoning attacks on pre-trained model\u0000hubs, focusing on the Hugging Face platform. We conduct a comprehensive threat\u0000analysis, develop a taxonomy of model formats, and perform root cause analysis\u0000of vulnerable formats. While existing tools like Fickling and ModelScan offer\u0000some protection, they face limitations in semantic-level analysis and\u0000comprehensive threat detection. To address these challenges, we propose MalHug,\u0000an end-to-end pipeline tailored for Hugging Face that combines dataset loading\u0000script extraction, model deserialization, in-depth taint analysis, and\u0000heuristic pattern matching to detect and classify malicious code poisoning\u0000attacks in datasets and models. In collaboration with Ant Group, a leading\u0000financial technology company, we have implemented and deployed MalHug on a\u0000mirrored Hugging Face instance within their infrastructure, where it has been\u0000operational for over three months. During this period, MalHug has monitored\u0000more than 705K models and 176K datasets, uncovering 91 malicious models and 9\u0000malicious dataset loading scripts. These findings reveal a range of security\u0000threats, including reverse shell, browser credential theft, and system\u0000reconnaissance. This work not only bridges a critical gap in understanding the\u0000security of the PTM supply chain but also provides a practical, industry-tested\u0000solution for enhancing the security of pre-trained model hubs.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present a new framework, named GPTAid, for automatic APSRs generation by analyzing API source code with LLM and detecting API misuse caused by incorrect parameter use. To validate the correctness of the LLM-generated APSRs, we propose an execution feedback-checking approach based on the observation that security-critical API misuse is often caused by APSRs violations, and most of them result in runtime errors. Specifically, GPTAid first uses LLM to generate raw APSRs and the Right calling code, and then generates Violation code for each raw APSR by modifying the Right calling code using LLM. Subsequently, GPTAid performs dynamic execution on each piece of Violation code and further filters out the incorrect APSRs based on runtime errors. To further generate concrete APSRs, GPTAid employs a code differential analysis to refine the filtered ones. Particularly, as the programming language is more precise than natural language, GPTAid identifies the key operations within Violation code by differential analysis, and then generates the corresponding concrete APSR based on the aforementioned operations. These concrete APSRs could be precisely interpreted into applicable detection code, which proven to be effective in API misuse detection. Implementing on the dataset containing 200 randomly selected APIs from eight popular libraries, GPTAid achieves a precision of 92.3%. Moreover, it generates 6 times more APSRs than state-of-the-art detectors on a comparison dataset of previously reported bugs and APSRs. We further evaluated GPTAid on 47 applications, 210 unknown security bugs were found potentially resulting in severe security issues (e.g., system crashes), 150 of which have been confirmed by developers after our reports.
{"title":"Generating API Parameter Security Rules with LLM for API Misuse Detection","authors":"Jinghua Liu, Yi Yang, Kai Chen, Miaoqian Lin","doi":"arxiv-2409.09288","DOIUrl":"https://doi.org/arxiv-2409.09288","url":null,"abstract":"In this paper, we present a new framework, named GPTAid, for automatic APSRs\u0000generation by analyzing API source code with LLM and detecting API misuse\u0000caused by incorrect parameter use. To validate the correctness of the\u0000LLM-generated APSRs, we propose an execution feedback-checking approach based\u0000on the observation that security-critical API misuse is often caused by APSRs\u0000violations, and most of them result in runtime errors. Specifically, GPTAid\u0000first uses LLM to generate raw APSRs and the Right calling code, and then\u0000generates Violation code for each raw APSR by modifying the Right calling code\u0000using LLM. Subsequently, GPTAid performs dynamic execution on each piece of\u0000Violation code and further filters out the incorrect APSRs based on runtime\u0000errors. To further generate concrete APSRs, GPTAid employs a code differential\u0000analysis to refine the filtered ones. Particularly, as the programming language\u0000is more precise than natural language, GPTAid identifies the key operations\u0000within Violation code by differential analysis, and then generates the\u0000corresponding concrete APSR based on the aforementioned operations. These\u0000concrete APSRs could be precisely interpreted into applicable detection code,\u0000which proven to be effective in API misuse detection. Implementing on the\u0000dataset containing 200 randomly selected APIs from eight popular libraries,\u0000GPTAid achieves a precision of 92.3%. Moreover, it generates 6 times more APSRs\u0000than state-of-the-art detectors on a comparison dataset of previously reported\u0000bugs and APSRs. We further evaluated GPTAid on 47 applications, 210 unknown\u0000security bugs were found potentially resulting in severe security issues (e.g.,\u0000system crashes), 150 of which have been confirmed by developers after our\u0000reports.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanxian Huang, Wanjun Zhong, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, Zibin Zheng, Yanlin Wang
In recent years, Large Language Models (LLMs) have achieved remarkable success and have been widely used in various downstream tasks, especially in the tasks of the software engineering (SE) field. We find that many studies combining LLMs with SE have employed the concept of agents either explicitly or implicitly. However, there is a lack of an in-depth survey to sort out the development context of existing works, analyze how existing works combine the LLM-based agent technologies to optimize various tasks, and clarify the framework of LLM-based agents in SE. In this paper, we conduct the first survey of the studies on combining LLM-based agents with SE and present a framework of LLM-based agents in SE which includes three key modules: perception, memory, and action. We also summarize the current challenges in combining the two fields and propose future opportunities in response to existing challenges. We maintain a GitHub repository of the related papers at: https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.
近年来,大型语言模型(LLM)取得了显著的成就,并被广泛应用于各种下游任务,尤其是软件工程(SE)领域的任务。我们发现,许多将 LLM 与 SE 结合起来的研究都或明或暗地使用了代理的概念。然而,目前还缺乏深入的调查来梳理现有研究的发展脉络,分析现有研究如何结合基于 LLM 的代理技术来优化各种任务,并阐明基于 LLM 的代理在 SE 中的框架。在本文中,我们首次对基于 LLM 的代理与 SE 的结合研究进行了调查,并提出了基于 LLM 的代理在 SE 中的框架,其中包括三个关键模块:感知、记忆和行动。我们还总结了当前将这两个领域结合起来所面临的挑战,并针对现有挑战提出了未来的机遇。我们在 GitHub 上建立了一个相关论文库:https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE。
{"title":"Agents in Software Engineering: Survey, Landscape, and Vision","authors":"Yanxian Huang, Wanjun Zhong, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, Zibin Zheng, Yanlin Wang","doi":"arxiv-2409.09030","DOIUrl":"https://doi.org/arxiv-2409.09030","url":null,"abstract":"In recent years, Large Language Models (LLMs) have achieved remarkable\u0000success and have been widely used in various downstream tasks, especially in\u0000the tasks of the software engineering (SE) field. We find that many studies\u0000combining LLMs with SE have employed the concept of agents either explicitly or\u0000implicitly. However, there is a lack of an in-depth survey to sort out the\u0000development context of existing works, analyze how existing works combine the\u0000LLM-based agent technologies to optimize various tasks, and clarify the\u0000framework of LLM-based agents in SE. In this paper, we conduct the first survey\u0000of the studies on combining LLM-based agents with SE and present a framework of\u0000LLM-based agents in SE which includes three key modules: perception, memory,\u0000and action. We also summarize the current challenges in combining the two\u0000fields and propose future opportunities in response to existing challenges. We\u0000maintain a GitHub repository of the related papers at:\u0000https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Code clones are code snippets that are identical or similar to other snippets within the same or different files. They are often created through copy-and-paste practices and modified during development and maintenance activities. Since a pair of code clones, known as a clone pair, has a possible logical coupling between them, it is expected that changes to each snippet are made simultaneously (co-changed) and consistently. There is extensive research on code clones, including studies related to the co-change of clones; however, detailed analysis of commit logs for code clone pairs has been limited. In this paper, we investigate the commit logs of code snippets from clone pairs, using the git-log command to extract changes to cloned code snippets. We analyzed 45 repositories owned by the Apache Software Foundation on GitHub and addressed three research questions regarding commit frequency, co-change ratio, and commit patterns. Our findings indicate that (1) on average, clone snippets are changed infrequently, typically only two or three times throughout their lifetime, (2) the ratio of co-changes is about half of all clone changes, with 10-20% of co-changed commits being concerning (potentially inconsistent), and (3) 35-65% of all clone pairs being classified as concerning clone pairs (potentially inconsistent clone pairs). These results suggest the need for a consistent management system through the commit timeline of clones.
{"title":"An Empirical Analysis of Git Commit Logs for Potential Inconsistency in Code Clones","authors":"Reishi Yokomori, Katsuro Inoue","doi":"arxiv-2409.08555","DOIUrl":"https://doi.org/arxiv-2409.08555","url":null,"abstract":"Code clones are code snippets that are identical or similar to other snippets\u0000within the same or different files. They are often created through\u0000copy-and-paste practices and modified during development and maintenance\u0000activities. Since a pair of code clones, known as a clone pair, has a possible\u0000logical coupling between them, it is expected that changes to each snippet are\u0000made simultaneously (co-changed) and consistently. There is extensive research\u0000on code clones, including studies related to the co-change of clones; however,\u0000detailed analysis of commit logs for code clone pairs has been limited. In this paper, we investigate the commit logs of code snippets from clone\u0000pairs, using the git-log command to extract changes to cloned code snippets. We\u0000analyzed 45 repositories owned by the Apache Software Foundation on GitHub and\u0000addressed three research questions regarding commit frequency, co-change ratio,\u0000and commit patterns. Our findings indicate that (1) on average, clone snippets\u0000are changed infrequently, typically only two or three times throughout their\u0000lifetime, (2) the ratio of co-changes is about half of all clone changes, with\u000010-20% of co-changed commits being concerning (potentially inconsistent), and\u0000(3) 35-65% of all clone pairs being classified as concerning clone pairs\u0000(potentially inconsistent clone pairs). These results suggest the need for a\u0000consistent management system through the commit timeline of clones.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mei Han, Lulu Wang, Jianming Chang, Bixin Li, Chunguang Zhang
Software projects are dependent on many third-party libraries, therefore high-risk vulnerabilities can propagate through the dependency chain to downstream projects. Owing to the subjective nature of patch management, software vendors commonly fix vulnerabilities silently. Silent vulnerability fixes cause downstream software to be unaware of urgent security issues in a timely manner, posing a security risk to the software. Presently, most of the existing works for vulnerability fix identification only consider the changed code as a sequential textual sequence, ignoring the structural information of the code. In this paper, we propose GRAPE, a GRAph-based Patch rEpresentation that aims to 1) provide a unified framework for getting vulnerability fix patches representation; and 2) enhance the understanding of the intent and potential impact of patches by extracting structural information of the code. GRAPE employs a novel joint graph structure (MCPG) to represent the syntactic and semantic information of fix patches and embeds both nodes and edges. Subsequently, a carefully designed graph convolutional neural network (NE-GCN) is utilized to fully learn structural features by leveraging the attributes of the nodes and edges. Moreover, we construct a dataset containing 2251 silent fixes. For the experimental section, we evaluated patch representation on three tasks, including vulnerability fix identification, vulnerability types classification, and vulnerability severity classification. Experimental results indicate that, in comparison to baseline methods, GRAPE can more effectively reduce false positives and omissions of vulnerability fixes identification and provide accurate vulnerability assessments.
{"title":"Learning Graph-based Patch Representations for Identifying and Assessing Silent Vulnerability Fixes","authors":"Mei Han, Lulu Wang, Jianming Chang, Bixin Li, Chunguang Zhang","doi":"arxiv-2409.08512","DOIUrl":"https://doi.org/arxiv-2409.08512","url":null,"abstract":"Software projects are dependent on many third-party libraries, therefore\u0000high-risk vulnerabilities can propagate through the dependency chain to\u0000downstream projects. Owing to the subjective nature of patch management,\u0000software vendors commonly fix vulnerabilities silently. Silent vulnerability\u0000fixes cause downstream software to be unaware of urgent security issues in a\u0000timely manner, posing a security risk to the software. Presently, most of the\u0000existing works for vulnerability fix identification only consider the changed\u0000code as a sequential textual sequence, ignoring the structural information of\u0000the code. In this paper, we propose GRAPE, a GRAph-based Patch rEpresentation\u0000that aims to 1) provide a unified framework for getting vulnerability fix\u0000patches representation; and 2) enhance the understanding of the intent and\u0000potential impact of patches by extracting structural information of the code.\u0000GRAPE employs a novel joint graph structure (MCPG) to represent the syntactic\u0000and semantic information of fix patches and embeds both nodes and edges.\u0000Subsequently, a carefully designed graph convolutional neural network (NE-GCN)\u0000is utilized to fully learn structural features by leveraging the attributes of\u0000the nodes and edges. Moreover, we construct a dataset containing 2251 silent\u0000fixes. For the experimental section, we evaluated patch representation on three\u0000tasks, including vulnerability fix identification, vulnerability types\u0000classification, and vulnerability severity classification. Experimental results\u0000indicate that, in comparison to baseline methods, GRAPE can more effectively\u0000reduce false positives and omissions of vulnerability fixes identification and\u0000provide accurate vulnerability assessments.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emergence of Autonomous Vehicles (AVs) has spurred research into testing the resilience of their perception systems, i.e. to ensure they are not susceptible to making critical misjudgements. It is important that they are tested not only with respect to other vehicles on the road, but also those objects placed on the roadside. Trash bins, billboards, and greenery are all examples of such objects, typically placed according to guidelines that were developed for the human visual system, and which may not align perfectly with the needs of AVs. Existing tests, however, usually focus on adversarial objects with conspicuous shapes/patches, that are ultimately unrealistic given their unnatural appearances and the need for white box knowledge. In this work, we introduce a black box attack on the perception systems of AVs, in which the objective is to create realistic adversarial scenarios (i.e. satisfying road design guidelines) by manipulating the positions of common roadside objects, and without resorting to `unnatural' adversarial patches. In particular, we propose TrashFuzz , a fuzzing algorithm to find scenarios in which the placement of these objects leads to substantial misperceptions by the AV -- such as mistaking a traffic light's colour -- with overall the goal of causing it to violate traffic laws. To ensure the realism of these scenarios, they must satisfy several rules encoding regulatory guidelines about the placement of objects on public streets. We implemented and evaluated these attacks for the Apollo, finding that TrashFuzz induced it into violating 15 out of 24 different traffic laws.
{"title":"Are Existing Road Design Guidelines Suitable for Autonomous Vehicles?","authors":"Yang Sun, Christopher M. Poskitt, Jun Sun","doi":"arxiv-2409.10562","DOIUrl":"https://doi.org/arxiv-2409.10562","url":null,"abstract":"The emergence of Autonomous Vehicles (AVs) has spurred research into testing\u0000the resilience of their perception systems, i.e. to ensure they are not\u0000susceptible to making critical misjudgements. It is important that they are\u0000tested not only with respect to other vehicles on the road, but also those\u0000objects placed on the roadside. Trash bins, billboards, and greenery are all\u0000examples of such objects, typically placed according to guidelines that were\u0000developed for the human visual system, and which may not align perfectly with\u0000the needs of AVs. Existing tests, however, usually focus on adversarial objects\u0000with conspicuous shapes/patches, that are ultimately unrealistic given their\u0000unnatural appearances and the need for white box knowledge. In this work, we\u0000introduce a black box attack on the perception systems of AVs, in which the\u0000objective is to create realistic adversarial scenarios (i.e. satisfying road\u0000design guidelines) by manipulating the positions of common roadside objects,\u0000and without resorting to `unnatural' adversarial patches. In particular, we\u0000propose TrashFuzz , a fuzzing algorithm to find scenarios in which the\u0000placement of these objects leads to substantial misperceptions by the AV --\u0000such as mistaking a traffic light's colour -- with overall the goal of causing\u0000it to violate traffic laws. To ensure the realism of these scenarios, they must\u0000satisfy several rules encoding regulatory guidelines about the placement of\u0000objects on public streets. We implemented and evaluated these attacks for the\u0000Apollo, finding that TrashFuzz induced it into violating 15 out of 24 different\u0000traffic laws.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Satisfiability-based automated reasoning is an approach that is being successfully used in software engineering to validate complex software, including for safety-critical systems. Such reasoning underlies many validation activities, from requirements analysis to design consistency to test coverage. While generally effective, the back-end constraint solvers are often complex and inevitably error-prone, which threatens the soundness of their application. Thus, such solvers need to be validated, which includes checking correctness and explaining (un)satisfiability results returned by them. In this work, we consider satisfiability analysis based on First-Order Logic with relational objects (FOL*) which has been shown to be effective for reasoning about time- and data-sensitive early system designs. We tackle the challenge of validating the correctness of FOL* unsatisfiability results and deriving diagnoses to explain the causes of the unsatisfiability. Inspired by the concept of proofs of UNSAT from SAT/SMT solvers, we define a proof format and proof rules to track the solvers' reasoning steps as sequences of derivations towards UNSAT. We also propose an algorithm to verify the correctness of FOL* proofs while filtering unnecessary derivations and develop a proof-based diagnosis to explain the cause of unsatisfiability. We implemented the proposed proof support on top of the state-of-the-art FOL* satisfiability checker to generate proofs of UNSAT and validated our approach by applying the proof-based diagnoses to explain the causes of well-formedness issues of normative requirements of software systems.
{"title":"Diagnosis via Proofs of Unsatisfiability for First-Order Logic with Relational Objects","authors":"Nick Feng, Lina Marsso, Marsha Chechik","doi":"arxiv-2409.09223","DOIUrl":"https://doi.org/arxiv-2409.09223","url":null,"abstract":"Satisfiability-based automated reasoning is an approach that is being\u0000successfully used in software engineering to validate complex software,\u0000including for safety-critical systems. Such reasoning underlies many validation\u0000activities, from requirements analysis to design consistency to test coverage.\u0000While generally effective, the back-end constraint solvers are often complex\u0000and inevitably error-prone, which threatens the soundness of their application.\u0000Thus, such solvers need to be validated, which includes checking correctness\u0000and explaining (un)satisfiability results returned by them. In this work, we\u0000consider satisfiability analysis based on First-Order Logic with relational\u0000objects (FOL*) which has been shown to be effective for reasoning about time-\u0000and data-sensitive early system designs. We tackle the challenge of validating\u0000the correctness of FOL* unsatisfiability results and deriving diagnoses to\u0000explain the causes of the unsatisfiability. Inspired by the concept of proofs\u0000of UNSAT from SAT/SMT solvers, we define a proof format and proof rules to\u0000track the solvers' reasoning steps as sequences of derivations towards UNSAT.\u0000We also propose an algorithm to verify the correctness of FOL* proofs while\u0000filtering unnecessary derivations and develop a proof-based diagnosis to\u0000explain the cause of unsatisfiability. We implemented the proposed proof\u0000support on top of the state-of-the-art FOL* satisfiability checker to generate\u0000proofs of UNSAT and validated our approach by applying the proof-based\u0000diagnoses to explain the causes of well-formedness issues of normative\u0000requirements of software systems.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wanja Zaeske, Pietro Albini, Florian Gilcher, Umut Durak
Testing is an essential tool to assure software, especially so in safety-critical applications. To quantify how thoroughly a software item has been tested, a test coverage metric is required. Maybe the strictest such metric known in the safety critical systems is Modified Condition/Decision Coverage (MC/DC), which DO-178C prescribes for the highest software assurance level in aviation. In the past, ambiguities in the interpretation of MC/DC have been resolved already, i. e. in CAST-10. However, some central features of the Rust programming language necessitate further clarification. This work investigates aforementioned features, in particular pattern matching, providing a consistent view on how to apply MC/DC to Rust. Hence, this paper informs the implementation of Rust MC/DC tools, paving the road towards Rust in high-assurance applications.
{"title":"Towards Modified Condition/Decision Coverage of Rust","authors":"Wanja Zaeske, Pietro Albini, Florian Gilcher, Umut Durak","doi":"arxiv-2409.08708","DOIUrl":"https://doi.org/arxiv-2409.08708","url":null,"abstract":"Testing is an essential tool to assure software, especially so in\u0000safety-critical applications. To quantify how thoroughly a software item has\u0000been tested, a test coverage metric is required. Maybe the strictest such\u0000metric known in the safety critical systems is Modified Condition/Decision\u0000Coverage (MC/DC), which DO-178C prescribes for the highest software assurance\u0000level in aviation. In the past, ambiguities in the interpretation of MC/DC have\u0000been resolved already, i. e. in CAST-10. However, some central features of the\u0000Rust programming language necessitate further clarification. This work\u0000investigates aforementioned features, in particular pattern matching, providing\u0000a consistent view on how to apply MC/DC to Rust. Hence, this paper informs the\u0000implementation of Rust MC/DC tools, paving the road towards Rust in\u0000high-assurance applications.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun
Selecting the best code solution from multiple generated ones is an essential task in code generation, which can be achieved by using some reliable validators (e.g., developer-written test cases) for assistance. Since reliable test cases are not always available and can be expensive to build in practice, researchers propose to automatically generate test cases to assess code solutions. However, when both code solutions and test cases are plausible and not reliable, selecting the best solution becomes challenging. Although some heuristic strategies have been proposed to tackle this problem, they lack a strong theoretical guarantee and it is still an open question whether an optimal selection strategy exists. Our work contributes in two ways. First, we show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. The problem of identifying the best solution is then framed as an integer programming problem. Second, we propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge. We then incorporate effective prior knowledge to tailor code generation tasks. Both theoretical and empirical studies confirm that existing heuristics are limited in selecting the best solutions with plausible test cases. Our proposed approximated optimal strategy B4 significantly surpasses existing heuristics in selecting code solutions generated by large language models (LLMs) with LLM-generated tests, achieving a relative performance improvement by up to 50% over the strongest heuristic and 246% over the random selection in the most challenging scenarios. Our code is publicly available at https://github.com/ZJU-CTAG/B4.
{"title":"B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests","authors":"Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun","doi":"arxiv-2409.08692","DOIUrl":"https://doi.org/arxiv-2409.08692","url":null,"abstract":"Selecting the best code solution from multiple generated ones is an essential\u0000task in code generation, which can be achieved by using some reliable\u0000validators (e.g., developer-written test cases) for assistance. Since reliable\u0000test cases are not always available and can be expensive to build in practice,\u0000researchers propose to automatically generate test cases to assess code\u0000solutions. However, when both code solutions and test cases are plausible and\u0000not reliable, selecting the best solution becomes challenging. Although some\u0000heuristic strategies have been proposed to tackle this problem, they lack a\u0000strong theoretical guarantee and it is still an open question whether an\u0000optimal selection strategy exists. Our work contributes in two ways. First, we\u0000show that within a Bayesian framework, the optimal selection strategy can be\u0000defined based on the posterior probability of the observed passing states\u0000between solutions and tests. The problem of identifying the best solution is\u0000then framed as an integer programming problem. Second, we propose an efficient\u0000approach for approximating this optimal (yet uncomputable) strategy, where the\u0000approximation error is bounded by the correctness of prior knowledge. We then\u0000incorporate effective prior knowledge to tailor code generation tasks. Both\u0000theoretical and empirical studies confirm that existing heuristics are limited\u0000in selecting the best solutions with plausible test cases. Our proposed\u0000approximated optimal strategy B4 significantly surpasses existing heuristics in\u0000selecting code solutions generated by large language models (LLMs) with\u0000LLM-generated tests, achieving a relative performance improvement by up to 50%\u0000over the strongest heuristic and 246% over the random selection in the most\u0000challenging scenarios. Our code is publicly available at\u0000https://github.com/ZJU-CTAG/B4.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}