There is strong motivation to translate C code into Rust code due to the continuing threat of memory safety vulnerabilities in existing C programs and the significant attention paid to Rust as an alternative to the C language. While large language models (LLMs) show promise for automating this translation by generating more natural and safer code than rule-based methods, previous studies have shown that LLM-generated Rust code often fails to compile, even for relatively small C programs, due to significant differences between the two languages and context window limitations. We propose an LLM-based translation scheme that improves the success rate of translating large-scale C code into compilable Rust code. Our approach involves three key techniques: (1) pre-processing the C code to better align its structure and expressions with Rust, (2) segmenting the code into optimally sized translation units to avoid exceeding the LLM's context window limits, and (3) iteratively compiling and repairing errors while maintaining consistency between translation units using context-supplementing prompts. Compilation success is an essential first step in achieving functional equivalence, as only compilable code can be further tested. In experiments with 20 benchmark C programs, including those exceeding 4 kilo lines of code, we successfully translated all programs into compilable Rust code without losing corresponding parts of the original code.
虽然大型语言模型(LLM)有望通过生成比基于规则的方法更自然、更安全的代码来实现自动翻译,但之前的研究表明,由于两种语言之间的显著差异和上下文窗口的限制,LLM生成的Rust代码往往无法编译,即使是相对较小的C程序也是如此。我们提出了一种基于 LLM 的翻译方案,可以提高将大规模 C 代码翻译为可编译 Rust 代码的成功率。我们的方法涉及三项关键技术:(1) 预处理 C 代码,使其结构和表达式更好地与 Rust 保持一致;(2) 将代码分割成最佳大小的翻译单元,避免超出 LLM 的上下文窗口限制;(3) 迭代编译和修复错误,同时使用上下文补充提示保持翻译单元之间的一致性。编译成功是实现功能等效的第一步,因为只有可编译代码才能进一步测试。在对 20 个基准 C 程序(包括超过 4 千行代码的程序)进行的实验中,我们成功地将所有程序都翻译成了可编译的 Rust 代码,而没有丢失原始代码的相应部分。
{"title":"Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models","authors":"Momoko Shiraishi, Takahiro Shinagawa","doi":"arxiv-2409.10506","DOIUrl":"https://doi.org/arxiv-2409.10506","url":null,"abstract":"There is strong motivation to translate C code into Rust code due to the\u0000continuing threat of memory safety vulnerabilities in existing C programs and\u0000the significant attention paid to Rust as an alternative to the C language.\u0000While large language models (LLMs) show promise for automating this translation\u0000by generating more natural and safer code than rule-based methods, previous\u0000studies have shown that LLM-generated Rust code often fails to compile, even\u0000for relatively small C programs, due to significant differences between the two\u0000languages and context window limitations. We propose an LLM-based translation\u0000scheme that improves the success rate of translating large-scale C code into\u0000compilable Rust code. Our approach involves three key techniques: (1)\u0000pre-processing the C code to better align its structure and expressions with\u0000Rust, (2) segmenting the code into optimally sized translation units to avoid\u0000exceeding the LLM's context window limits, and (3) iteratively compiling and\u0000repairing errors while maintaining consistency between translation units using\u0000context-supplementing prompts. Compilation success is an essential first step\u0000in achieving functional equivalence, as only compilable code can be further\u0000tested. In experiments with 20 benchmark C programs, including those exceeding\u00004 kilo lines of code, we successfully translated all programs into compilable\u0000Rust code without losing corresponding parts of the original code.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobina Shahbandeh, Parsa Alian, Noor Nashid, Ali Mesbah
End-to-end web testing is challenging due to the need to explore diverse web application functionalities. Current state-of-the-art methods, such as WebCanvas, are not designed for broad functionality exploration; they rely on specific, detailed task descriptions, limiting their adaptability in dynamic web environments. We introduce NaviQAte, which frames web application exploration as a question-and-answer task, generating action sequences for functionalities without requiring detailed parameters. Our three-phase approach utilizes advanced large language models like GPT-4o for complex decision-making and cost-effective models, such as GPT-4o mini, for simpler tasks. NaviQAte focuses on functionality-guided web application navigation, integrating multi-modal inputs such as text and images to enhance contextual understanding. Evaluations on the Mind2Web-Live and Mind2Web-Live-Abstracted datasets show that NaviQAte achieves a 44.23% success rate in user task navigation and a 38.46% success rate in functionality navigation, representing a 15% and 33% improvement over WebCanvas. These results underscore the effectiveness of our approach in advancing automated web application testing.
{"title":"NaviQAte: Functionality-Guided Web Application Navigation","authors":"Mobina Shahbandeh, Parsa Alian, Noor Nashid, Ali Mesbah","doi":"arxiv-2409.10741","DOIUrl":"https://doi.org/arxiv-2409.10741","url":null,"abstract":"End-to-end web testing is challenging due to the need to explore diverse web\u0000application functionalities. Current state-of-the-art methods, such as\u0000WebCanvas, are not designed for broad functionality exploration; they rely on\u0000specific, detailed task descriptions, limiting their adaptability in dynamic\u0000web environments. We introduce NaviQAte, which frames web application\u0000exploration as a question-and-answer task, generating action sequences for\u0000functionalities without requiring detailed parameters. Our three-phase approach\u0000utilizes advanced large language models like GPT-4o for complex decision-making\u0000and cost-effective models, such as GPT-4o mini, for simpler tasks. NaviQAte\u0000focuses on functionality-guided web application navigation, integrating\u0000multi-modal inputs such as text and images to enhance contextual understanding.\u0000Evaluations on the Mind2Web-Live and Mind2Web-Live-Abstracted datasets show\u0000that NaviQAte achieves a 44.23% success rate in user task navigation and a\u000038.46% success rate in functionality navigation, representing a 15% and 33%\u0000improvement over WebCanvas. These results underscore the effectiveness of our\u0000approach in advancing automated web application testing.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ana Nunez, Nafis Tanveer Islam, Sumit Kumar Jha, Peyman Najafirad
Recent advancements in automatic code generation using large language models (LLMs) have brought us closer to fully automated secure software development. However, existing approaches often rely on a single agent for code generation, which struggles to produce secure, vulnerability-free code. Traditional program synthesis with LLMs has primarily focused on functional correctness, often neglecting critical dynamic security implications that happen during runtime. To address these challenges, we propose AutoSafeCoder, a multi-agent framework that leverages LLM-driven agents for code generation, vulnerability analysis, and security enhancement through continuous collaboration. The framework consists of three agents: a Coding Agent responsible for code generation, a Static Analyzer Agent identifying vulnerabilities, and a Fuzzing Agent performing dynamic testing using a mutation-based fuzzing approach to detect runtime errors. Our contribution focuses on ensuring the safety of multi-agent code generation by integrating dynamic and static testing in an iterative process during code generation by LLM that improves security. Experiments using the SecurityEval dataset demonstrate a 13% reduction in code vulnerabilities compared to baseline LLMs, with no compromise in functionality.
{"title":"AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing","authors":"Ana Nunez, Nafis Tanveer Islam, Sumit Kumar Jha, Peyman Najafirad","doi":"arxiv-2409.10737","DOIUrl":"https://doi.org/arxiv-2409.10737","url":null,"abstract":"Recent advancements in automatic code generation using large language models\u0000(LLMs) have brought us closer to fully automated secure software development.\u0000However, existing approaches often rely on a single agent for code generation,\u0000which struggles to produce secure, vulnerability-free code. Traditional program\u0000synthesis with LLMs has primarily focused on functional correctness, often\u0000neglecting critical dynamic security implications that happen during runtime.\u0000To address these challenges, we propose AutoSafeCoder, a multi-agent framework\u0000that leverages LLM-driven agents for code generation, vulnerability analysis,\u0000and security enhancement through continuous collaboration. The framework\u0000consists of three agents: a Coding Agent responsible for code generation, a\u0000Static Analyzer Agent identifying vulnerabilities, and a Fuzzing Agent\u0000performing dynamic testing using a mutation-based fuzzing approach to detect\u0000runtime errors. Our contribution focuses on ensuring the safety of multi-agent\u0000code generation by integrating dynamic and static testing in an iterative\u0000process during code generation by LLM that improves security. Experiments using\u0000the SecurityEval dataset demonstrate a 13% reduction in code vulnerabilities\u0000compared to baseline LLMs, with no compromise in functionality.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LLM agents enhanced by tree search algorithms have yielded notable performances in code generation. However, current search algorithms in this domain suffer from low search quality due to several reasons: 1) Ineffective design of the search space for the high-reasoning demands of code generation tasks, 2) Inadequate integration of code feedback with the search algorithm, and 3) Poor handling of negative feedback during the search, leading to reduced search efficiency and quality. To address these challenges, we propose to search for the reasoning process of the code and use the detailed feedback of code execution to refine erroneous thoughts during the search. In this paper, we introduce RethinkMCTS, which employs the Monte Carlo Tree Search (MCTS) algorithm to conduct thought-level searches before generating code, thereby exploring a wider range of strategies. More importantly, we construct verbal feedback from fine-grained code execution feedback to refine erroneous thoughts during the search. This ensures that the search progresses along the correct reasoning paths, thus improving the overall search quality of the tree by leveraging execution feedback. Through extensive experiments, we demonstrate that RethinkMCTS outperforms previous search-based and feedback-based code generation baselines. On the HumanEval dataset, it improves the pass@1 of GPT-3.5-turbo from 70.12 to 89.02 and GPT-4o-mini from 87.20 to 94.51. It effectively conducts more thorough exploration through thought-level searches and enhances the search quality of the entire tree by incorporating rethink operation.
{"title":"RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation","authors":"Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Dai, Ruiming Tang, Yasheng Wang, Yong Yu, Weinan Zhang","doi":"arxiv-2409.09584","DOIUrl":"https://doi.org/arxiv-2409.09584","url":null,"abstract":"LLM agents enhanced by tree search algorithms have yielded notable\u0000performances in code generation. However, current search algorithms in this\u0000domain suffer from low search quality due to several reasons: 1) Ineffective\u0000design of the search space for the high-reasoning demands of code generation\u0000tasks, 2) Inadequate integration of code feedback with the search algorithm,\u0000and 3) Poor handling of negative feedback during the search, leading to reduced\u0000search efficiency and quality. To address these challenges, we propose to\u0000search for the reasoning process of the code and use the detailed feedback of\u0000code execution to refine erroneous thoughts during the search. In this paper,\u0000we introduce RethinkMCTS, which employs the Monte Carlo Tree Search (MCTS)\u0000algorithm to conduct thought-level searches before generating code, thereby\u0000exploring a wider range of strategies. More importantly, we construct verbal\u0000feedback from fine-grained code execution feedback to refine erroneous thoughts\u0000during the search. This ensures that the search progresses along the correct\u0000reasoning paths, thus improving the overall search quality of the tree by\u0000leveraging execution feedback. Through extensive experiments, we demonstrate\u0000that RethinkMCTS outperforms previous search-based and feedback-based code\u0000generation baselines. On the HumanEval dataset, it improves the pass@1 of\u0000GPT-3.5-turbo from 70.12 to 89.02 and GPT-4o-mini from 87.20 to 94.51. It\u0000effectively conducts more thorough exploration through thought-level searches\u0000and enhances the search quality of the entire tree by incorporating rethink\u0000operation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Smart contracts are susceptible to being exploited by attackers, especially when facing real-world vulnerabilities. To mitigate this risk, developers often rely on third-party audit services to identify potential vulnerabilities before project deployment. Nevertheless, repairing the identified vulnerabilities is still complex and labor-intensive, particularly for developers lacking security expertise. Moreover, existing pattern-based repair tools mostly fail to address real-world vulnerabilities due to their lack of high-level semantic understanding. To fill this gap, we propose ContractTinker, a Large Language Models (LLMs)-empowered tool for real-world vulnerability repair. The key insight is our adoption of the Chain-of-Thought approach to break down the entire generation task into sub-tasks. Additionally, to reduce hallucination, we integrate program static analysis to guide the LLM. We evaluate ContractTinker on 48 high-risk vulnerabilities. The experimental results show that among the patches generated by ContractTinker, 23 (48%) are valid patches that fix the vulnerabilities, while 10 (21%) require only minor modifications. A video of ContractTinker is available at https://youtu.be/HWFVi-YHcPE.
{"title":"ContractTinker: LLM-Empowered Vulnerability Repair for Real-World Smart Contracts","authors":"Che Wang, Jiashuo Zhang, Jianbo Gao, Libin Xia, Zhi Guan, Zhong Chen","doi":"arxiv-2409.09661","DOIUrl":"https://doi.org/arxiv-2409.09661","url":null,"abstract":"Smart contracts are susceptible to being exploited by attackers, especially\u0000when facing real-world vulnerabilities. To mitigate this risk, developers often\u0000rely on third-party audit services to identify potential vulnerabilities before\u0000project deployment. Nevertheless, repairing the identified vulnerabilities is\u0000still complex and labor-intensive, particularly for developers lacking security\u0000expertise. Moreover, existing pattern-based repair tools mostly fail to address\u0000real-world vulnerabilities due to their lack of high-level semantic\u0000understanding. To fill this gap, we propose ContractTinker, a Large Language\u0000Models (LLMs)-empowered tool for real-world vulnerability repair. The key\u0000insight is our adoption of the Chain-of-Thought approach to break down the\u0000entire generation task into sub-tasks. Additionally, to reduce hallucination,\u0000we integrate program static analysis to guide the LLM. We evaluate\u0000ContractTinker on 48 high-risk vulnerabilities. The experimental results show\u0000that among the patches generated by ContractTinker, 23 (48%) are valid patches\u0000that fix the vulnerabilities, while 10 (21%) require only minor modifications.\u0000A video of ContractTinker is available at https://youtu.be/HWFVi-YHcPE.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate estimation of project costs and durations remains a pivotal challenge in software engineering, directly impacting budgeting and resource management. Traditional estimation techniques, although widely utilized, often fall short due to their complexity and the dynamic nature of software development projects. This study introduces an innovative approach using Large Language Models (LLMs) to enhance the accuracy and usability of project cost predictions. We explore the efficacy of LLMs against traditional methods and contemporary machine learning techniques, focusing on their potential to simplify the estimation process and provide higher accuracy. Our research is structured around critical inquiries into whether LLMs can outperform existing models, the ease of their integration into current practices, outperform traditional estimation, and why traditional methods still prevail in industry settings. By applying LLMs to a range of real-world datasets and comparing their performance to both state-of-the-art and conventional methods, this study aims to demonstrate that LLMs not only yield more accurate estimates but also offer a user-friendly alternative to complex predictive models, potentially transforming project management strategies within the software industry.
{"title":"Leveraging Large Language Models for Predicting Cost and Duration in Software Engineering Projects","authors":"Justin Carpenter, Chia-Ying Wu, Nasir U. Eisty","doi":"arxiv-2409.09617","DOIUrl":"https://doi.org/arxiv-2409.09617","url":null,"abstract":"Accurate estimation of project costs and durations remains a pivotal\u0000challenge in software engineering, directly impacting budgeting and resource\u0000management. Traditional estimation techniques, although widely utilized, often\u0000fall short due to their complexity and the dynamic nature of software\u0000development projects. This study introduces an innovative approach using Large\u0000Language Models (LLMs) to enhance the accuracy and usability of project cost\u0000predictions. We explore the efficacy of LLMs against traditional methods and\u0000contemporary machine learning techniques, focusing on their potential to\u0000simplify the estimation process and provide higher accuracy. Our research is\u0000structured around critical inquiries into whether LLMs can outperform existing\u0000models, the ease of their integration into current practices, outperform\u0000traditional estimation, and why traditional methods still prevail in industry\u0000settings. By applying LLMs to a range of real-world datasets and comparing\u0000their performance to both state-of-the-art and conventional methods, this study\u0000aims to demonstrate that LLMs not only yield more accurate estimates but also\u0000offer a user-friendly alternative to complex predictive models, potentially\u0000transforming project management strategies within the software industry.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. B. Pronin, A. V. Volosova, A. V. Ostroukh, Yu. N. Strogov
In this paper, an approach to training and evaluating an adapter model for the popular language model "zephyr-7b-beta" is described. The adapter was developed to improve the performance of the base model in tasks related to programming and understanding the Russian language. Considering the high quality of the original model in tasks in the English language, the goal of the research was to expand its linguistic and technical spectrum. The proposed adapter was trained using a large and diverse dataset, including question-answer pairs related to programming, as well code-related texts in Russian language. The applied training methodology ensures an improvement in the model's quality of answers in understanding and generating Python code based on Russian instructions. We evaluated the performance of the base model with the installed adapter using various metrics, comparing it to the base model as well as other state-of-the-art models in this field. The obtained results showed significant improvement, both in tasks related to writing Python code and in processing the Russian language, confirming the effectiveness of the proposed adapter.
{"title":"Overcoming linguistic barriers in code assistants: creating a QLoRA adapter to improve support for Russian-language code writing instructions","authors":"C. B. Pronin, A. V. Volosova, A. V. Ostroukh, Yu. N. Strogov","doi":"arxiv-2409.09353","DOIUrl":"https://doi.org/arxiv-2409.09353","url":null,"abstract":"In this paper, an approach to training and evaluating an adapter model for\u0000the popular language model \"zephyr-7b-beta\" is described. The adapter was\u0000developed to improve the performance of the base model in tasks related to\u0000programming and understanding the Russian language. Considering the high\u0000quality of the original model in tasks in the English language, the goal of the\u0000research was to expand its linguistic and technical spectrum. The proposed\u0000adapter was trained using a large and diverse dataset, including\u0000question-answer pairs related to programming, as well code-related texts in\u0000Russian language. The applied training methodology ensures an improvement in\u0000the model's quality of answers in understanding and generating Python code\u0000based on Russian instructions. We evaluated the performance of the base model\u0000with the installed adapter using various metrics, comparing it to the base\u0000model as well as other state-of-the-art models in this field. The obtained\u0000results showed significant improvement, both in tasks related to writing Python\u0000code and in processing the Russian language, confirming the effectiveness of\u0000the proposed adapter.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenyang Yang, Yining Hong, Grace A. Lewis, Tongshuang Wu, Christian Kästner
Machine learning models make mistakes, yet sometimes it is difficult to identify the systematic problems behind the mistakes. Practitioners engage in various activities, including error analysis, testing, auditing, and red-teaming, to form hypotheses of what can go (or has gone) wrong with their models. To validate these hypotheses, practitioners employ data slicing to identify relevant examples. However, traditional data slicing is limited by available features and programmatic slicing functions. In this work, we propose SemSlicer, a framework that supports semantic data slicing, which identifies a semantically coherent slice, without the need for existing features. SemSlicer uses Large Language Models to annotate datasets and generate slices from any user-defined slicing criteria. We show that SemSlicer generates accurate slices with low cost, allows flexible trade-offs between different design dimensions, reliably identifies under-performing data slices, and helps practitioners identify useful data slices that reflect systematic problems.
{"title":"What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing","authors":"Chenyang Yang, Yining Hong, Grace A. Lewis, Tongshuang Wu, Christian Kästner","doi":"arxiv-2409.09261","DOIUrl":"https://doi.org/arxiv-2409.09261","url":null,"abstract":"Machine learning models make mistakes, yet sometimes it is difficult to\u0000identify the systematic problems behind the mistakes. Practitioners engage in\u0000various activities, including error analysis, testing, auditing, and\u0000red-teaming, to form hypotheses of what can go (or has gone) wrong with their\u0000models. To validate these hypotheses, practitioners employ data slicing to\u0000identify relevant examples. However, traditional data slicing is limited by\u0000available features and programmatic slicing functions. In this work, we propose\u0000SemSlicer, a framework that supports semantic data slicing, which identifies a\u0000semantically coherent slice, without the need for existing features. SemSlicer\u0000uses Large Language Models to annotate datasets and generate slices from any\u0000user-defined slicing criteria. We show that SemSlicer generates accurate slices\u0000with low cost, allows flexible trade-offs between different design dimensions,\u0000reliably identifies under-performing data slices, and helps practitioners\u0000identify useful data slices that reflect systematic problems.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The exponential growth of open-source package ecosystems, particularly NPM and PyPI, has led to an alarming increase in software supply chain poisoning attacks. Existing static analysis methods struggle with high false positive rates and are easily thwarted by obfuscation and dynamic code execution techniques. While dynamic analysis approaches offer improvements, they often suffer from capturing non-package behaviors and employing simplistic testing strategies that fail to trigger sophisticated malicious behaviors. To address these challenges, we present OSCAR, a robust dynamic code poisoning detection pipeline for NPM and PyPI ecosystems. OSCAR fully executes packages in a sandbox environment, employs fuzz testing on exported functions and classes, and implements aspect-based behavior monitoring with tailored API hook points. We evaluate OSCAR against six existing tools using a comprehensive benchmark dataset of real-world malicious and benign packages. OSCAR achieves an F1 score of 0.95 in NPM and 0.91 in PyPI, confirming that OSCAR is as effective as the current state-of-the-art technologies. Furthermore, for benign packages exhibiting characteristics typical of malicious packages, OSCAR reduces the false positive rate by an average of 32.06% in NPM (from 34.63% to 2.57%) and 39.87% in PyPI (from 41.10% to 1.23%), compared to other tools, significantly reducing the workload of manual reviews in real-world deployments. In cooperation with Ant Group, a leading financial technology company, we have deployed OSCAR on its NPM and PyPI mirrors since January 2023, identifying 10,404 malicious NPM packages and 1,235 malicious PyPI packages over 18 months. This work not only bridges the gap between academic research and industrial application in code poisoning detection but also provides a robust and practical solution that has been thoroughly tested in a real-world industrial setting.
开源软件包生态系统(尤其是 NPM 和 PyPI)的指数式增长导致软件供应链中毒攻击的惊人增长。现有的静态分析方法误报率很高,很容易被混淆和动态代码执行技术所挫败。虽然动态分析方法有所改进,但它们往往无法捕捉到非软件包行为,采用的简单测试策略也无法触发复杂的恶意行为。为了应对这些挑战,我们提出了 OSCAR,这是一个适用于 NPM 和 PyPI 生态系统的强大的动态代码中毒检测管道。OSCAR 在andbox 环境中完全执行软件包,对导出函数和类进行模糊测试,并通过定制的 API 钩子点实现基于方面的行为监控。OSCAR 在 NPM 中的 F1 得分为 0.95,在 PyPI 中的 F1 得分为 0.91,证明 OSCAR 与当前最先进的技术一样有效。此外,对于具有恶意软件包典型特征的良性软件包,与其他工具相比,OSCAR 在 NPM 中平均降低了 32.06%(从 34.63% 降至 2.57%),在 PyPI 中平均降低了 39.87%(从 41.10% 降至 1.23%),显著减少了实际部署中人工审核的工作量。我们与领先的金融科技公司蚂蚁金服集团合作,自2023年1月起在其NPM和PyPI镜像上部署了OSCAR,在18个月的时间里识别出了10,404个恶意NPM包和1,235个恶意PyPI包。这项工作不仅缩小了代码中毒检测方面的学术研究与工业应用之间的差距,还提供了一个在真实的工业环境中经过全面测试的强大而实用的解决方案。
{"title":"Towards Robust Detection of Open Source Software Supply Chain Poisoning Attacks in Industry Environments","authors":"Xinyi Zheng, Chen Wei, Shenao Wang, Yanjie Zhao, Peiming Gao, Yuanchao Zhang, Kailong Wang, Haoyu Wang","doi":"arxiv-2409.09356","DOIUrl":"https://doi.org/arxiv-2409.09356","url":null,"abstract":"The exponential growth of open-source package ecosystems, particularly NPM\u0000and PyPI, has led to an alarming increase in software supply chain poisoning\u0000attacks. Existing static analysis methods struggle with high false positive\u0000rates and are easily thwarted by obfuscation and dynamic code execution\u0000techniques. While dynamic analysis approaches offer improvements, they often\u0000suffer from capturing non-package behaviors and employing simplistic testing\u0000strategies that fail to trigger sophisticated malicious behaviors. To address\u0000these challenges, we present OSCAR, a robust dynamic code poisoning detection\u0000pipeline for NPM and PyPI ecosystems. OSCAR fully executes packages in a\u0000sandbox environment, employs fuzz testing on exported functions and classes,\u0000and implements aspect-based behavior monitoring with tailored API hook points.\u0000We evaluate OSCAR against six existing tools using a comprehensive benchmark\u0000dataset of real-world malicious and benign packages. OSCAR achieves an F1 score\u0000of 0.95 in NPM and 0.91 in PyPI, confirming that OSCAR is as effective as the\u0000current state-of-the-art technologies. Furthermore, for benign packages\u0000exhibiting characteristics typical of malicious packages, OSCAR reduces the\u0000false positive rate by an average of 32.06% in NPM (from 34.63% to 2.57%) and\u000039.87% in PyPI (from 41.10% to 1.23%), compared to other tools, significantly\u0000reducing the workload of manual reviews in real-world deployments. In\u0000cooperation with Ant Group, a leading financial technology company, we have\u0000deployed OSCAR on its NPM and PyPI mirrors since January 2023, identifying\u000010,404 malicious NPM packages and 1,235 malicious PyPI packages over 18 months.\u0000This work not only bridges the gap between academic research and industrial\u0000application in code poisoning detection but also provides a robust and\u0000practical solution that has been thoroughly tested in a real-world industrial\u0000setting.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui
Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context. This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests? The effectiveness of test cases is measured by their accuracy, coverage, and bug detection effectiveness. Our evaluation results with five open- and six closed-source LLMs on four datasets demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions and correct code, but only 57.12% when given task descriptions and incorrect code. For the APPS dataset, prompts with correct code yield tests that detect 39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These findings have important implications for the deployment of LLM-based testing: using it on mature code may help protect against future regression, but on early-stage immature code, it may simply bake in errors. Our findings also underscore the need for further research to improve LLMs resilience against incorrect code in generating reliable and bug-revealing tests.
{"title":"Rethinking the Influence of Source Code on Test Case Generation","authors":"Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui","doi":"arxiv-2409.09464","DOIUrl":"https://doi.org/arxiv-2409.09464","url":null,"abstract":"Large language models (LLMs) have been widely applied to assist test\u0000generation with the source code under test provided as the context. This paper\u0000aims to answer the question: If the source code under test is incorrect, will\u0000LLMs be misguided when generating tests? The effectiveness of test cases is\u0000measured by their accuracy, coverage, and bug detection effectiveness. Our\u0000evaluation results with five open- and six closed-source LLMs on four datasets\u0000demonstrate that incorrect code can significantly mislead LLMs in generating\u0000correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval\u0000dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions\u0000and correct code, but only 57.12% when given task descriptions and incorrect\u0000code. For the APPS dataset, prompts with correct code yield tests that detect\u000039.85% of the bugs, while prompts with incorrect code detect only 19.61%. These\u0000findings have important implications for the deployment of LLM-based testing:\u0000using it on mature code may help protect against future regression, but on\u0000early-stage immature code, it may simply bake in errors. Our findings also\u0000underscore the need for further research to improve LLMs resilience against\u0000incorrect code in generating reliable and bug-revealing tests.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}