arXiv - CS - Software Engineering最新文献_第4页

Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models 使用大型语言模型进行上下文感知代码分割以实现 C 到 Rust 翻译

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10506

Momoko Shiraishi, Takahiro Shinagawa

There is strong motivation to translate C code into Rust code due to thecontinuing threat of memory safety vulnerabilities in existing C programs andthe significant attention paid to Rust as an alternative to the C language.While large language models (LLMs) show promise for automating this translationby generating more natural and safer code than rule-based methods, previousstudies have shown that LLM-generated Rust code often fails to compile, evenfor relatively small C programs, due to significant differences between the twolanguages and context window limitations. We propose an LLM-based translationscheme that improves the success rate of translating large-scale C code intocompilable Rust code. Our approach involves three key techniques: (1)pre-processing the C code to better align its structure and expressions withRust, (2) segmenting the code into optimally sized translation units to avoidexceeding the LLM's context window limits, and (3) iteratively compiling andrepairing errors while maintaining consistency between translation units usingcontext-supplementing prompts. Compilation success is an essential first stepin achieving functional equivalence, as only compilable code can be furthertested. In experiments with 20 benchmark C programs, including those exceeding4 kilo lines of code, we successfully translated all programs into compilableRust code without losing corresponding parts of the original code.

虽然大型语言模型（LLM）有望通过生成比基于规则的方法更自然、更安全的代码来实现自动翻译，但之前的研究表明，由于两种语言之间的显著差异和上下文窗口的限制，LLM生成的Rust代码往往无法编译，即使是相对较小的C程序也是如此。我们提出了一种基于 LLM 的翻译方案，可以提高将大规模 C 代码翻译为可编译 Rust 代码的成功率。我们的方法涉及三项关键技术：(1) 预处理 C 代码，使其结构和表达式更好地与 Rust 保持一致；(2) 将代码分割成最佳大小的翻译单元，避免超出 LLM 的上下文窗口限制；(3) 迭代编译和修复错误，同时使用上下文补充提示保持翻译单元之间的一致性。编译成功是实现功能等效的第一步，因为只有可编译代码才能进一步测试。在对 20 个基准 C 程序（包括超过 4 千行代码的程序）进行的实验中，我们成功地将所有程序都翻译成了可编译的 Rust 代码，而没有丢失原始代码的相应部分。

{"title":"Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models","authors":"Momoko Shiraishi, Takahiro Shinagawa","doi":"arxiv-2409.10506","DOIUrl":"https://doi.org/arxiv-2409.10506","url":null,"abstract":"There is strong motivation to translate C code into Rust code due to the\u0000continuing threat of memory safety vulnerabilities in existing C programs and\u0000the significant attention paid to Rust as an alternative to the C language.\u0000While large language models (LLMs) show promise for automating this translation\u0000by generating more natural and safer code than rule-based methods, previous\u0000studies have shown that LLM-generated Rust code often fails to compile, even\u0000for relatively small C programs, due to significant differences between the two\u0000languages and context window limitations. We propose an LLM-based translation\u0000scheme that improves the success rate of translating large-scale C code into\u0000compilable Rust code. Our approach involves three key techniques: (1)\u0000pre-processing the C code to better align its structure and expressions with\u0000Rust, (2) segmenting the code into optimally sized translation units to avoid\u0000exceeding the LLM's context window limits, and (3) iteratively compiling and\u0000repairing errors while maintaining consistency between translation units using\u0000context-supplementing prompts. Compilation success is an essential first step\u0000in achieving functional equivalence, as only compilable code can be further\u0000tested. In experiments with 20 benchmark C programs, including those exceeding\u00004 kilo lines of code, we successfully translated all programs into compilable\u0000Rust code without losing corresponding parts of the original code.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NaviQAte: Functionality-Guided Web Application Navigation NaviQAte：功能引导型网络应用程序导航

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10741

Mobina Shahbandeh, Parsa Alian, Noor Nashid, Ali Mesbah

End-to-end web testing is challenging due to the need to explore diverse webapplication functionalities. Current state-of-the-art methods, such asWebCanvas, are not designed for broad functionality exploration; they rely onspecific, detailed task descriptions, limiting their adaptability in dynamicweb environments. We introduce NaviQAte, which frames web applicationexploration as a question-and-answer task, generating action sequences forfunctionalities without requiring detailed parameters. Our three-phase approachutilizes advanced large language models like GPT-4o for complex decision-makingand cost-effective models, such as GPT-4o mini, for simpler tasks. NaviQAtefocuses on functionality-guided web application navigation, integratingmulti-modal inputs such as text and images to enhance contextual understanding.Evaluations on the Mind2Web-Live and Mind2Web-Live-Abstracted datasets showthat NaviQAte achieves a 44.23% success rate in user task navigation and a38.46% success rate in functionality navigation, representing a 15% and 33%improvement over WebCanvas. These results underscore the effectiveness of ourapproach in advancing automated web application testing.

由于需要探索各种网络应用程序的功能，端到端网络测试具有挑战性。目前最先进的方法，如 WebCanvas，并不是为广泛的功能探索而设计的；它们依赖于具体、详细的任务描述，这限制了它们在动态网络环境中的适应性。我们介绍的 NaviQAte 将网络应用程序探索视为问答任务，无需详细参数即可生成功能的操作序列。我们的三阶段方法利用先进的大型语言模型（如用于复杂决策的 GPT-4o 模型）和经济高效的模型（如用于简单任务的 GPT-4o mini 模型）。在 Mind2Web-Live 和 Mind2Web-Live-Abstracted 数据集上的评估表明，NaviQAte 在用户任务导航方面的成功率为 44.23%，在功能导航方面的成功率为 38.46%，分别比 WebCanvas 提高了 15%和 33%。这些结果证明了我们的方法在推进自动化网络应用程序测试方面的有效性。

{"title":"NaviQAte: Functionality-Guided Web Application Navigation","authors":"Mobina Shahbandeh, Parsa Alian, Noor Nashid, Ali Mesbah","doi":"arxiv-2409.10741","DOIUrl":"https://doi.org/arxiv-2409.10741","url":null,"abstract":"End-to-end web testing is challenging due to the need to explore diverse web\u0000application functionalities. Current state-of-the-art methods, such as\u0000WebCanvas, are not designed for broad functionality exploration; they rely on\u0000specific, detailed task descriptions, limiting their adaptability in dynamic\u0000web environments. We introduce NaviQAte, which frames web application\u0000exploration as a question-and-answer task, generating action sequences for\u0000functionalities without requiring detailed parameters. Our three-phase approach\u0000utilizes advanced large language models like GPT-4o for complex decision-making\u0000and cost-effective models, such as GPT-4o mini, for simpler tasks. NaviQAte\u0000focuses on functionality-guided web application navigation, integrating\u0000multi-modal inputs such as text and images to enhance contextual understanding.\u0000Evaluations on the Mind2Web-Live and Mind2Web-Live-Abstracted datasets show\u0000that NaviQAte achieves a 44.23% success rate in user task navigation and a\u000038.46% success rate in functionality navigation, representing a 15% and 33%\u0000improvement over WebCanvas. These results underscore the effectiveness of our\u0000approach in advancing automated web application testing.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing AutoSafeCoder：通过静态分析和模糊测试确保 LLM 代码生成安全的多代理框架

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10737

Ana Nunez, Nafis Tanveer Islam, Sumit Kumar Jha, Peyman Najafirad

Recent advancements in automatic code generation using large language models(LLMs) have brought us closer to fully automated secure software development.However, existing approaches often rely on a single agent for code generation,which struggles to produce secure, vulnerability-free code. Traditional programsynthesis with LLMs has primarily focused on functional correctness, oftenneglecting critical dynamic security implications that happen during runtime.To address these challenges, we propose AutoSafeCoder, a multi-agent frameworkthat leverages LLM-driven agents for code generation, vulnerability analysis,and security enhancement through continuous collaboration. The frameworkconsists of three agents: a Coding Agent responsible for code generation, aStatic Analyzer Agent identifying vulnerabilities, and a Fuzzing Agentperforming dynamic testing using a mutation-based fuzzing approach to detectruntime errors. Our contribution focuses on ensuring the safety of multi-agentcode generation by integrating dynamic and static testing in an iterativeprocess during code generation by LLM that improves security. Experiments usingthe SecurityEval dataset demonstrate a 13% reduction in code vulnerabilitiescompared to baseline LLMs, with no compromise in functionality.

使用大型语言模型（LLMs）自动生成代码的最新进展让我们离全自动安全软件开发更近了一步。然而，现有的方法通常依赖单个代理进行代码生成，很难生成安全、无漏洞的代码。为了应对这些挑战，我们提出了 AutoSafeCoder，这是一个多代理框架，利用 LLM 驱动的代理进行代码生成、漏洞分析，并通过持续协作增强安全性。该框架由三个代理组成：负责代码生成的编码代理（Coding Agent）、识别漏洞的静态分析代理（Static Analyzer Agent）和使用基于突变的模糊方法执行动态测试以检测运行时错误的模糊代理（Fuzzing Agent）。我们的贡献主要在于通过在 LLM 代码生成的迭代过程中集成动态和静态测试，确保多代理代码生成的安全性，从而提高安全性。使用 SecurityEval 数据集进行的实验表明，与基线 LLM 相比，代码漏洞减少了 13%，而且功能没有受到影响。

{"title":"AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing","authors":"Ana Nunez, Nafis Tanveer Islam, Sumit Kumar Jha, Peyman Najafirad","doi":"arxiv-2409.10737","DOIUrl":"https://doi.org/arxiv-2409.10737","url":null,"abstract":"Recent advancements in automatic code generation using large language models\u0000(LLMs) have brought us closer to fully automated secure software development.\u0000However, existing approaches often rely on a single agent for code generation,\u0000which struggles to produce secure, vulnerability-free code. Traditional program\u0000synthesis with LLMs has primarily focused on functional correctness, often\u0000neglecting critical dynamic security implications that happen during runtime.\u0000To address these challenges, we propose AutoSafeCoder, a multi-agent framework\u0000that leverages LLM-driven agents for code generation, vulnerability analysis,\u0000and security enhancement through continuous collaboration. The framework\u0000consists of three agents: a Coding Agent responsible for code generation, a\u0000Static Analyzer Agent identifying vulnerabilities, and a Fuzzing Agent\u0000performing dynamic testing using a mutation-based fuzzing approach to detect\u0000runtime errors. Our contribution focuses on ensuring the safety of multi-agent\u0000code generation by integrating dynamic and static testing in an iterative\u0000process during code generation by LLM that improves security. Experiments using\u0000the SecurityEval dataset demonstrate a 13% reduction in code vulnerabilities\u0000compared to baseline LLMs, with no compromise in functionality.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"99 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation RethinkMCTS：改进蒙特卡洛树搜索代码生成中的错误想法

arXiv - CS - Software Engineering

Pub Date : 2024-09-15 DOI: arxiv-2409.09584

Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Dai, Ruiming Tang, Yasheng Wang, Yong Yu, Weinan Zhang

LLM agents enhanced by tree search algorithms have yielded notableperformances in code generation. However, current search algorithms in thisdomain suffer from low search quality due to several reasons: 1) Ineffectivedesign of the search space for the high-reasoning demands of code generationtasks, 2) Inadequate integration of code feedback with the search algorithm,and 3) Poor handling of negative feedback during the search, leading to reducedsearch efficiency and quality. To address these challenges, we propose tosearch for the reasoning process of the code and use the detailed feedback ofcode execution to refine erroneous thoughts during the search. In this paper,we introduce RethinkMCTS, which employs the Monte Carlo Tree Search (MCTS)algorithm to conduct thought-level searches before generating code, therebyexploring a wider range of strategies. More importantly, we construct verbalfeedback from fine-grained code execution feedback to refine erroneous thoughtsduring the search. This ensures that the search progresses along the correctreasoning paths, thus improving the overall search quality of the tree byleveraging execution feedback. Through extensive experiments, we demonstratethat RethinkMCTS outperforms previous search-based and feedback-based codegeneration baselines. On the HumanEval dataset, it improves the pass@1 ofGPT-3.5-turbo from 70.12 to 89.02 and GPT-4o-mini from 87.20 to 94.51. Iteffectively conducts more thorough exploration through thought-level searchesand enhances the search quality of the entire tree by incorporating rethinkoperation.

通过树搜索算法增强的 LLM 代理在代码生成方面取得了显著的性能。然而，由于以下几个原因，目前该领域的搜索算法存在搜索质量低的问题：1）针对代码生成任务的高推理要求而设计的搜索空间效果不佳；2）代码反馈与搜索算法的整合不充分；3）搜索过程中对负面反馈的处理不当，导致搜索效率和质量下降。为了应对这些挑战，我们建议对代码的推理过程进行搜索，并在搜索过程中利用代码执行的详细反馈来完善错误的想法。在本文中，我们介绍了 RethinkMCTS，它采用蒙特卡洛树搜索（Monte Carlo Tree Search，MCTS）算法，在生成代码前进行思想层面的搜索，从而探索更广泛的策略。更重要的是，我们从细粒度的代码执行反馈中构建了口头反馈，以便在搜索过程中改进错误的想法。这确保了搜索沿着正确的推理路径进行，从而通过利用执行反馈提高了树的整体搜索质量。通过大量实验，我们证明 RethinkMCTS 的性能优于之前基于搜索和反馈的代码生成基线。在 HumanEval 数据集上，它将 GPT-3.5-turbo 的 pass@1 从 70.12 提高到 89.02，将 GPT-4o-mini 的 pass@1 从 87.20 提高到 94.51。它通过思考级搜索有效地进行了更彻底的探索，并通过结合重新思考操作提高了整个树的搜索质量。

{"title":"RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation","authors":"Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Dai, Ruiming Tang, Yasheng Wang, Yong Yu, Weinan Zhang","doi":"arxiv-2409.09584","DOIUrl":"https://doi.org/arxiv-2409.09584","url":null,"abstract":"LLM agents enhanced by tree search algorithms have yielded notable\u0000performances in code generation. However, current search algorithms in this\u0000domain suffer from low search quality due to several reasons: 1) Ineffective\u0000design of the search space for the high-reasoning demands of code generation\u0000tasks, 2) Inadequate integration of code feedback with the search algorithm,\u0000and 3) Poor handling of negative feedback during the search, leading to reduced\u0000search efficiency and quality. To address these challenges, we propose to\u0000search for the reasoning process of the code and use the detailed feedback of\u0000code execution to refine erroneous thoughts during the search. In this paper,\u0000we introduce RethinkMCTS, which employs the Monte Carlo Tree Search (MCTS)\u0000algorithm to conduct thought-level searches before generating code, thereby\u0000exploring a wider range of strategies. More importantly, we construct verbal\u0000feedback from fine-grained code execution feedback to refine erroneous thoughts\u0000during the search. This ensures that the search progresses along the correct\u0000reasoning paths, thus improving the overall search quality of the tree by\u0000leveraging execution feedback. Through extensive experiments, we demonstrate\u0000that RethinkMCTS outperforms previous search-based and feedback-based code\u0000generation baselines. On the HumanEval dataset, it improves the pass@1 of\u0000GPT-3.5-turbo from 70.12 to 89.02 and GPT-4o-mini from 87.20 to 94.51. It\u0000effectively conducts more thorough exploration through thought-level searches\u0000and enhances the search quality of the entire tree by incorporating rethink\u0000operation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"211 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ContractTinker: LLM-Empowered Vulnerability Repair for Real-World Smart Contracts ContractTinker：为真实世界智能合约提供 LLM 驱动的漏洞修复功能

arXiv - CS - Software Engineering

Pub Date : 2024-09-15 DOI: arxiv-2409.09661

Che Wang, Jiashuo Zhang, Jianbo Gao, Libin Xia, Zhi Guan, Zhong Chen

Smart contracts are susceptible to being exploited by attackers, especiallywhen facing real-world vulnerabilities. To mitigate this risk, developers oftenrely on third-party audit services to identify potential vulnerabilities beforeproject deployment. Nevertheless, repairing the identified vulnerabilities isstill complex and labor-intensive, particularly for developers lacking securityexpertise. Moreover, existing pattern-based repair tools mostly fail to addressreal-world vulnerabilities due to their lack of high-level semanticunderstanding. To fill this gap, we propose ContractTinker, a Large LanguageModels (LLMs)-empowered tool for real-world vulnerability repair. The keyinsight is our adoption of the Chain-of-Thought approach to break down theentire generation task into sub-tasks. Additionally, to reduce hallucination,we integrate program static analysis to guide the LLM. We evaluateContractTinker on 48 high-risk vulnerabilities. The experimental results showthat among the patches generated by ContractTinker, 23 (48%) are valid patchesthat fix the vulnerabilities, while 10 (21%) require only minor modifications.A video of ContractTinker is available at https://youtu.be/HWFVi-YHcPE.

智能合约很容易被攻击者利用，尤其是在面对真实世界的漏洞时。为了降低这种风险，开发人员通常依靠第三方审计服务在项目部署前找出潜在漏洞。然而，修复识别出的漏洞仍然复杂且耗费人力，对于缺乏安全专业知识的开发人员来说尤其如此。此外，现有的基于模式的修复工具由于缺乏对高层语义的理解，大多无法解决现实世界中的漏洞问题。为了填补这一空白，我们提出了 ContractTinker，这是一种大型语言模型（LLMs）驱动的真实世界漏洞修复工具。其关键之处在于我们采用了 "思维链"（Chain-of-Thought）方法，将整个生成任务分解为多个子任务。此外，为了减少幻觉，我们还集成了程序静态分析来指导 LLM。我们在 48 个高危漏洞上对 ContractTinker 进行了评估。实验结果表明，在 ContractTinker 生成的补丁中，有 23 个（48%）是修复漏洞的有效补丁，而 10 个（21%）只需稍作修改即可。ContractTinker 的视频可在 https://youtu.be/HWFVi-YHcPE 上观看。

{"title":"ContractTinker: LLM-Empowered Vulnerability Repair for Real-World Smart Contracts","authors":"Che Wang, Jiashuo Zhang, Jianbo Gao, Libin Xia, Zhi Guan, Zhong Chen","doi":"arxiv-2409.09661","DOIUrl":"https://doi.org/arxiv-2409.09661","url":null,"abstract":"Smart contracts are susceptible to being exploited by attackers, especially\u0000when facing real-world vulnerabilities. To mitigate this risk, developers often\u0000rely on third-party audit services to identify potential vulnerabilities before\u0000project deployment. Nevertheless, repairing the identified vulnerabilities is\u0000still complex and labor-intensive, particularly for developers lacking security\u0000expertise. Moreover, existing pattern-based repair tools mostly fail to address\u0000real-world vulnerabilities due to their lack of high-level semantic\u0000understanding. To fill this gap, we propose ContractTinker, a Large Language\u0000Models (LLMs)-empowered tool for real-world vulnerability repair. The key\u0000insight is our adoption of the Chain-of-Thought approach to break down the\u0000entire generation task into sub-tasks. Additionally, to reduce hallucination,\u0000we integrate program static analysis to guide the LLM. We evaluate\u0000ContractTinker on 48 high-risk vulnerabilities. The experimental results show\u0000that among the patches generated by ContractTinker, 23 (48%) are valid patches\u0000that fix the vulnerabilities, while 10 (21%) require only minor modifications.\u0000A video of ContractTinker is available at https://youtu.be/HWFVi-YHcPE.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging Large Language Models for Predicting Cost and Duration in Software Engineering Projects 利用大型语言模型预测软件工程项目的成本和工期

arXiv - CS - Software Engineering

Pub Date : 2024-09-15 DOI: arxiv-2409.09617

Justin Carpenter, Chia-Ying Wu, Nasir U. Eisty

Accurate estimation of project costs and durations remains a pivotalchallenge in software engineering, directly impacting budgeting and resourcemanagement. Traditional estimation techniques, although widely utilized, oftenfall short due to their complexity and the dynamic nature of softwaredevelopment projects. This study introduces an innovative approach using LargeLanguage Models (LLMs) to enhance the accuracy and usability of project costpredictions. We explore the efficacy of LLMs against traditional methods andcontemporary machine learning techniques, focusing on their potential tosimplify the estimation process and provide higher accuracy. Our research isstructured around critical inquiries into whether LLMs can outperform existingmodels, the ease of their integration into current practices, outperformtraditional estimation, and why traditional methods still prevail in industrysettings. By applying LLMs to a range of real-world datasets and comparingtheir performance to both state-of-the-art and conventional methods, this studyaims to demonstrate that LLMs not only yield more accurate estimates but alsooffer a user-friendly alternative to complex predictive models, potentiallytransforming project management strategies within the software industry.

准确估算项目成本和工期仍然是软件工程中的一个关键挑战，直接影响到预算编制和资源管理。传统的估算技术虽然得到了广泛应用，但由于其复杂性和软件开发项目的动态性，往往无法达到预期效果。本研究介绍了一种使用大型语言模型（LLM）的创新方法，以提高项目成本预测的准确性和可用性。我们探讨了 LLMs 对传统方法和当代机器学习技术的功效，重点关注 LLMs 在简化估算过程和提供更高精度方面的潜力。我们的研究围绕以下关键问题展开：LLM 是否能够超越现有模型，是否易于集成到当前实践中，是否能够超越传统估算方法，以及为什么传统方法在行业环境中仍然占主导地位。通过将 LLMs 应用于一系列实际数据集，并将其性能与最先进的方法和传统方法进行比较，本研究旨在证明 LLMs 不仅能产生更准确的估算结果，还能提供一种用户友好型方法来替代复杂的预测模型，从而有可能改变软件行业的项目管理策略。

{"title":"Leveraging Large Language Models for Predicting Cost and Duration in Software Engineering Projects","authors":"Justin Carpenter, Chia-Ying Wu, Nasir U. Eisty","doi":"arxiv-2409.09617","DOIUrl":"https://doi.org/arxiv-2409.09617","url":null,"abstract":"Accurate estimation of project costs and durations remains a pivotal\u0000challenge in software engineering, directly impacting budgeting and resource\u0000management. Traditional estimation techniques, although widely utilized, often\u0000fall short due to their complexity and the dynamic nature of software\u0000development projects. This study introduces an innovative approach using Large\u0000Language Models (LLMs) to enhance the accuracy and usability of project cost\u0000predictions. We explore the efficacy of LLMs against traditional methods and\u0000contemporary machine learning techniques, focusing on their potential to\u0000simplify the estimation process and provide higher accuracy. Our research is\u0000structured around critical inquiries into whether LLMs can outperform existing\u0000models, the ease of their integration into current practices, outperform\u0000traditional estimation, and why traditional methods still prevail in industry\u0000settings. By applying LLMs to a range of real-world datasets and comparing\u0000their performance to both state-of-the-art and conventional methods, this study\u0000aims to demonstrate that LLMs not only yield more accurate estimates but also\u0000offer a user-friendly alternative to complex predictive models, potentially\u0000transforming project management strategies within the software industry.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"118 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Overcoming linguistic barriers in code assistants: creating a QLoRA adapter to improve support for Russian-language code writing instructions 克服代码助手中的语言障碍：创建 QLoRA 适配器以改进对俄语代码编写说明的支持

arXiv - CS - Software Engineering

Pub Date : 2024-09-14 DOI: arxiv-2409.09353

C. B. Pronin, A. V. Volosova, A. V. Ostroukh, Yu. N. Strogov

In this paper, an approach to training and evaluating an adapter model forthe popular language model "zephyr-7b-beta" is described. The adapter wasdeveloped to improve the performance of the base model in tasks related toprogramming and understanding the Russian language. Considering the highquality of the original model in tasks in the English language, the goal of theresearch was to expand its linguistic and technical spectrum. The proposedadapter was trained using a large and diverse dataset, includingquestion-answer pairs related to programming, as well code-related texts inRussian language. The applied training methodology ensures an improvement inthe model's quality of answers in understanding and generating Python codebased on Russian instructions. We evaluated the performance of the base modelwith the installed adapter using various metrics, comparing it to the basemodel as well as other state-of-the-art models in this field. The obtainedresults showed significant improvement, both in tasks related to writing Pythoncode and in processing the Russian language, confirming the effectiveness ofthe proposed adapter.

本文介绍了一种训练和评估流行语言模型 "zephyr-7b-beta "的适配器模型的方法。开发该适配器的目的是为了提高基础模型在编程和理解俄语相关任务中的性能。考虑到原始模型在英语任务中的高质量，研究的目标是扩大其语言和技术范围。所提出的适配器使用了大量不同的数据集进行训练，其中包括与编程相关的问答对，以及与代码相关的俄语文本。所采用的训练方法确保了该模型在理解和生成基于俄语指令的 Python 代码时的答案质量的提高。我们使用各种指标评估了基础模型和已安装适配器的性能，并将其与基础模型以及该领域其他最先进的模型进行了比较。结果表明，在编写 Python 代码和处理俄语的相关任务方面都有显著提高，这证明了所提议的适配器的有效性。

{"title":"Overcoming linguistic barriers in code assistants: creating a QLoRA adapter to improve support for Russian-language code writing instructions","authors":"C. B. Pronin, A. V. Volosova, A. V. Ostroukh, Yu. N. Strogov","doi":"arxiv-2409.09353","DOIUrl":"https://doi.org/arxiv-2409.09353","url":null,"abstract":"In this paper, an approach to training and evaluating an adapter model for\u0000the popular language model \"zephyr-7b-beta\" is described. The adapter was\u0000developed to improve the performance of the base model in tasks related to\u0000programming and understanding the Russian language. Considering the high\u0000quality of the original model in tasks in the English language, the goal of the\u0000research was to expand its linguistic and technical spectrum. The proposed\u0000adapter was trained using a large and diverse dataset, including\u0000question-answer pairs related to programming, as well code-related texts in\u0000Russian language. The applied training methodology ensures an improvement in\u0000the model's quality of answers in understanding and generating Python code\u0000based on Russian instructions. We evaluated the performance of the base model\u0000with the installed adapter using various metrics, comparing it to the base\u0000model as well as other state-of-the-art models in this field. The obtained\u0000results showed significant improvement, both in tasks related to writing Python\u0000code and in processing the Russian language, confirming the effectiveness of\u0000the proposed adapter.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing 我的模型出了什么问题？识别语义数据切分的系统性问题

arXiv - CS - Software Engineering

Pub Date : 2024-09-14 DOI: arxiv-2409.09261

Chenyang Yang, Yining Hong, Grace A. Lewis, Tongshuang Wu, Christian Kästner

Machine learning models make mistakes, yet sometimes it is difficult toidentify the systematic problems behind the mistakes. Practitioners engage invarious activities, including error analysis, testing, auditing, andred-teaming, to form hypotheses of what can go (or has gone) wrong with theirmodels. To validate these hypotheses, practitioners employ data slicing toidentify relevant examples. However, traditional data slicing is limited byavailable features and programmatic slicing functions. In this work, we proposeSemSlicer, a framework that supports semantic data slicing, which identifies asemantically coherent slice, without the need for existing features. SemSliceruses Large Language Models to annotate datasets and generate slices from anyuser-defined slicing criteria. We show that SemSlicer generates accurate sliceswith low cost, allows flexible trade-offs between different design dimensions,reliably identifies under-performing data slices, and helps practitionersidentify useful data slices that reflect systematic problems.

机器学习模型会犯错，但有时很难识别错误背后的系统性问题。实践者会参与各种活动，包括错误分析、测试、审计和红队，以形成关于其模型可能出错（或已经出错）的假设。为了验证这些假设，实践者会使用数据切片来识别相关示例。然而，传统的数据切片受到可用功能和程序切片功能的限制。在这项工作中，我们提出了一个支持语义数据切片的框架--SemSlicer，它无需现有特征即可识别语义连贯的切片。SemSlicer 利用大型语言模型来注释数据集，并根据任何用户定义的切片标准生成切片。我们的研究表明，SemSlicer 能够以较低的成本生成准确的切片，在不同的设计维度之间灵活权衡，可靠地识别性能不佳的数据切片，并帮助从业人员识别反映系统性问题的有用数据切片。

引用次数: 0

Towards Robust Detection of Open Source Software Supply Chain Poisoning Attacks in Industry Environments 在工业环境中实现对开源软件供应链中毒攻击的稳健检测

arXiv - CS - Software Engineering

Pub Date : 2024-09-14 DOI: arxiv-2409.09356

Xinyi Zheng, Chen Wei, Shenao Wang, Yanjie Zhao, Peiming Gao, Yuanchao Zhang, Kailong Wang, Haoyu Wang

The exponential growth of open-source package ecosystems, particularly NPMand PyPI, has led to an alarming increase in software supply chain poisoningattacks. Existing static analysis methods struggle with high false positiverates and are easily thwarted by obfuscation and dynamic code executiontechniques. While dynamic analysis approaches offer improvements, they oftensuffer from capturing non-package behaviors and employing simplistic testingstrategies that fail to trigger sophisticated malicious behaviors. To addressthese challenges, we present OSCAR, a robust dynamic code poisoning detectionpipeline for NPM and PyPI ecosystems. OSCAR fully executes packages in asandbox environment, employs fuzz testing on exported functions and classes,and implements aspect-based behavior monitoring with tailored API hook points.We evaluate OSCAR against six existing tools using a comprehensive benchmarkdataset of real-world malicious and benign packages. OSCAR achieves an F1 scoreof 0.95 in NPM and 0.91 in PyPI, confirming that OSCAR is as effective as thecurrent state-of-the-art technologies. Furthermore, for benign packagesexhibiting characteristics typical of malicious packages, OSCAR reduces thefalse positive rate by an average of 32.06% in NPM (from 34.63% to 2.57%) and39.87% in PyPI (from 41.10% to 1.23%), compared to other tools, significantlyreducing the workload of manual reviews in real-world deployments. Incooperation with Ant Group, a leading financial technology company, we havedeployed OSCAR on its NPM and PyPI mirrors since January 2023, identifying10,404 malicious NPM packages and 1,235 malicious PyPI packages over 18 months.This work not only bridges the gap between academic research and industrialapplication in code poisoning detection but also provides a robust andpractical solution that has been thoroughly tested in a real-world industrialsetting.

开源软件包生态系统（尤其是 NPM 和 PyPI）的指数式增长导致软件供应链中毒攻击的惊人增长。现有的静态分析方法误报率很高，很容易被混淆和动态代码执行技术所挫败。虽然动态分析方法有所改进，但它们往往无法捕捉到非软件包行为，采用的简单测试策略也无法触发复杂的恶意行为。为了应对这些挑战，我们提出了 OSCAR，这是一个适用于 NPM 和 PyPI 生态系统的强大的动态代码中毒检测管道。OSCAR 在andbox 环境中完全执行软件包，对导出函数和类进行模糊测试，并通过定制的 API 钩子点实现基于方面的行为监控。OSCAR 在 NPM 中的 F1 得分为 0.95，在 PyPI 中的 F1 得分为 0.91，证明 OSCAR 与当前最先进的技术一样有效。此外，对于具有恶意软件包典型特征的良性软件包，与其他工具相比，OSCAR 在 NPM 中平均降低了 32.06%（从 34.63% 降至 2.57%），在 PyPI 中平均降低了 39.87%（从 41.10% 降至 1.23%），显著减少了实际部署中人工审核的工作量。我们与领先的金融科技公司蚂蚁金服集团合作，自2023年1月起在其NPM和PyPI镜像上部署了OSCAR，在18个月的时间里识别出了10,404个恶意NPM包和1,235个恶意PyPI包。这项工作不仅缩小了代码中毒检测方面的学术研究与工业应用之间的差距，还提供了一个在真实的工业环境中经过全面测试的强大而实用的解决方案。

{"title":"Towards Robust Detection of Open Source Software Supply Chain Poisoning Attacks in Industry Environments","authors":"Xinyi Zheng, Chen Wei, Shenao Wang, Yanjie Zhao, Peiming Gao, Yuanchao Zhang, Kailong Wang, Haoyu Wang","doi":"arxiv-2409.09356","DOIUrl":"https://doi.org/arxiv-2409.09356","url":null,"abstract":"The exponential growth of open-source package ecosystems, particularly NPM\u0000and PyPI, has led to an alarming increase in software supply chain poisoning\u0000attacks. Existing static analysis methods struggle with high false positive\u0000rates and are easily thwarted by obfuscation and dynamic code execution\u0000techniques. While dynamic analysis approaches offer improvements, they often\u0000suffer from capturing non-package behaviors and employing simplistic testing\u0000strategies that fail to trigger sophisticated malicious behaviors. To address\u0000these challenges, we present OSCAR, a robust dynamic code poisoning detection\u0000pipeline for NPM and PyPI ecosystems. OSCAR fully executes packages in a\u0000sandbox environment, employs fuzz testing on exported functions and classes,\u0000and implements aspect-based behavior monitoring with tailored API hook points.\u0000We evaluate OSCAR against six existing tools using a comprehensive benchmark\u0000dataset of real-world malicious and benign packages. OSCAR achieves an F1 score\u0000of 0.95 in NPM and 0.91 in PyPI, confirming that OSCAR is as effective as the\u0000current state-of-the-art technologies. Furthermore, for benign packages\u0000exhibiting characteristics typical of malicious packages, OSCAR reduces the\u0000false positive rate by an average of 32.06% in NPM (from 34.63% to 2.57%) and\u000039.87% in PyPI (from 41.10% to 1.23%), compared to other tools, significantly\u0000reducing the workload of manual reviews in real-world deployments. In\u0000cooperation with Ant Group, a leading financial technology company, we have\u0000deployed OSCAR on its NPM and PyPI mirrors since January 2023, identifying\u000010,404 malicious NPM packages and 1,235 malicious PyPI packages over 18 months.\u0000This work not only bridges the gap between academic research and industrial\u0000application in code poisoning detection but also provides a robust and\u0000practical solution that has been thoroughly tested in a real-world industrial\u0000setting.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rethinking the Influence of Source Code on Test Case Generation 重新思考源代码对测试用例生成的影响

arXiv - CS - Software Engineering

Pub Date : 2024-09-14 DOI: arxiv-2409.09464

Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui

Large language models (LLMs) have been widely applied to assist testgeneration with the source code under test provided as the context. This paperaims to answer the question: If the source code under test is incorrect, willLLMs be misguided when generating tests? The effectiveness of test cases ismeasured by their accuracy, coverage, and bug detection effectiveness. Ourevaluation results with five open- and six closed-source LLMs on four datasetsdemonstrate that incorrect code can significantly mislead LLMs in generatingcorrect, high-coverage, and bug-revealing tests. For instance, in the HumanEvaldataset, LLMs achieve 80.45% test accuracy when provided with task descriptionsand correct code, but only 57.12% when given task descriptions and incorrectcode. For the APPS dataset, prompts with correct code yield tests that detect39.85% of the bugs, while prompts with incorrect code detect only 19.61%. Thesefindings have important implications for the deployment of LLM-based testing:using it on mature code may help protect against future regression, but onearly-stage immature code, it may simply bake in errors. Our findings alsounderscore the need for further research to improve LLMs resilience againstincorrect code in generating reliable and bug-revealing tests.

大语言模型（LLM）已被广泛应用于以被测源代码为上下文的辅助测试生成。本文旨在回答这个问题：如果被测源代码不正确，LLM 在生成测试时是否会被误导？测试用例的有效性通过其准确性、覆盖率和错误检测有效性来衡量。在四个数据集上使用五种开放源代码和六种封闭源代码的 LLM 的评估结果表明，不正确的代码会严重误导 LLM 生成正确、高覆盖率和能揭示错误的测试。例如，在 HumanEval 数据集中，当提供任务描述和正确代码时，LLM 的测试准确率为 80.45%，而当提供任务描述和错误代码时，LLM 的测试准确率仅为 57.12%。在 APPS 数据集中，提示正确代码的测试能检测出 39.85% 的错误，而提示错误代码的测试只能检测出 19.61%。这些发现对部署基于 LLM 的测试具有重要意义：在成熟代码上使用 LLM 可能有助于防止未来的回归，但在早期阶段的不成熟代码上使用 LLM 可能会简单地造成错误。我们的发现还强调了进一步研究的必要性，以提高 LLM 在生成可靠且能揭示错误的测试时对错误代码的抵御能力。

{"title":"Rethinking the Influence of Source Code on Test Case Generation","authors":"Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui","doi":"arxiv-2409.09464","DOIUrl":"https://doi.org/arxiv-2409.09464","url":null,"abstract":"Large language models (LLMs) have been widely applied to assist test\u0000generation with the source code under test provided as the context. This paper\u0000aims to answer the question: If the source code under test is incorrect, will\u0000LLMs be misguided when generating tests? The effectiveness of test cases is\u0000measured by their accuracy, coverage, and bug detection effectiveness. Our\u0000evaluation results with five open- and six closed-source LLMs on four datasets\u0000demonstrate that incorrect code can significantly mislead LLMs in generating\u0000correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval\u0000dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions\u0000and correct code, but only 57.12% when given task descriptions and incorrect\u0000code. For the APPS dataset, prompts with correct code yield tests that detect\u000039.85% of the bugs, while prompts with incorrect code detect only 19.61%. These\u0000findings have important implications for the deployment of LLM-based testing:\u0000using it on mature code may help protect against future regression, but on\u0000early-stage immature code, it may simply bake in errors. Our findings also\u0000underscore the need for further research to improve LLMs resilience against\u0000incorrect code in generating reliable and bug-revealing tests.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0