On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools

IF 3.5 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Empirical Software Engineering Pub Date : 2024-08-03 DOI:10.1007/s10664-024-10506-z

Aurora Papotti, Ranindya Paramitha, Fabio Massacci

{"title":"On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools","authors":"Aurora Papotti, Ranindya Paramitha, Fabio Massacci","doi":"10.1007/s10664-024-10506-z","DOIUrl":null,"url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Objective</h3><p>We investigated whether (possibly wrong) security patches suggested by Automated Program Repairs (APR) for real world projects are recognized by human reviewers. We also investigated whether knowing that a patch was produced by an allegedly specialized tool does change the decision of human reviewers.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We perform an experiment with <span>\\(n= 72\\)</span> Master students in Computer Science. In the first phase, using a balanced design, we propose to human reviewers a combination of patches proposed by APR tools for different vulnerabilities and ask reviewers to adopt or reject the proposed patches. In the second phase, we tell participants that some of the proposed patches were generated by security-specialized tools (even if the tool was actually a ‘normal’ APR tool) and measure whether the human reviewers would change their decision to adopt or reject a patch.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>It is easier to identify wrong patches than correct patches, and correct patches are not confused with partially correct patches. Also patches from APR Security tools are adopted more often than patches suggested by generic APR tools but there is not enough evidence to verify if ‘bogus’ security claims are distinguishable from ‘true security’ claims. Finally, the number of switches to the patches suggested by security tool is significantly higher after the security information is revealed irrespective of correctness.</p><h3 data-test=\"abstract-sub-heading\">Limitations</h3><p>The experiment was conducted in an academic setting, and focused on a limited sample of popular APR tools and popular vulnerability types.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"52 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10506-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

We investigated whether (possibly wrong) security patches suggested by Automated Program Repairs (APR) for real world projects are recognized by human reviewers. We also investigated whether knowing that a patch was produced by an allegedly specialized tool does change the decision of human reviewers.

Method

We perform an experiment with \(n= 72\) Master students in Computer Science. In the first phase, using a balanced design, we propose to human reviewers a combination of patches proposed by APR tools for different vulnerabilities and ask reviewers to adopt or reject the proposed patches. In the second phase, we tell participants that some of the proposed patches were generated by security-specialized tools (even if the tool was actually a ‘normal’ APR tool) and measure whether the human reviewers would change their decision to adopt or reject a patch.

Results

It is easier to identify wrong patches than correct patches, and correct patches are not confused with partially correct patches. Also patches from APR Security tools are adopted more often than patches suggested by generic APR tools but there is not enough evidence to verify if ‘bogus’ security claims are distinguishable from ‘true security’ claims. Finally, the number of switches to the patches suggested by security tool is significantly higher after the security information is revealed irrespective of correctness.

Limitations

The experiment was conducted in an academic setting, and focused on a limited sample of popular APR tools and popular vulnerability types.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关于代码审查员接受自动程序修复工具建议的候选安全补丁的问题

目的我们研究了自动程序修复（APR）为真实世界的项目建议的（可能是错误的）安全补丁是否能被人类审查员识别。我们还调查了知道补丁是由所谓的专业工具制作的是否会改变人类审查员的决定。方法我们与计算机科学专业的硕士生（72 人）进行了一项实验。在第一阶段，我们采用平衡设计，针对不同的漏洞，向人类审查者提出由APR工具提出的补丁组合，并要求审查者采纳或拒绝所提出的补丁。在第二阶段，我们告诉参与者所提出的补丁中有一部分是由安全专业工具生成的（即使该工具实际上是一个 "普通 "的 APR 工具），并测量人类审查员是否会改变他们采用或拒绝补丁的决定。结果识别错误补丁比识别正确补丁更容易，正确补丁与部分正确补丁不会混淆。此外，来自 APR 安全工具的补丁比通用 APR 工具建议的补丁更容易被采用，但没有足够的证据来验证 "假 "安全声明与 "真安全 "声明是否可以区分。最后，无论安全信息的正确与否，在安全工具建议的补丁被揭示后，改用安全工具建议的补丁的次数明显较多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Empirical Software Engineering 工程技术-计算机：软件工程

CiteScore

8.50

自引率

12.20%

发文量

169

审稿时长

>12 weeks

期刊介绍： Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.

期刊最新文献

The effect of data complexity on classifier performance. Reinforcement learning for online testing of autonomous driving systems: a replication and extension study. An empirical study on developers’ shared conversations with ChatGPT in GitHub pull requests and issues Quality issues in machine learning software systems An empirical study of token-based micro commits