Patch correctness assessment in automated program repair based on the impact of patches on production and test code

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis Pub Date : 2022-07-18 DOI:10.1145/3533767.3534368

Ali Ghanbari, Andrian Marcus

{"title":"Patch correctness assessment in automated program repair based on the impact of patches on production and test code","authors":"Ali Ghanbari, Andrian Marcus","doi":"10.1145/3533767.3534368","DOIUrl":null,"url":null,"abstract":"Test-based generate-and-validate automated program repair (APR) systems often generate many patches that pass the test suite without fixing the bug. The generated patches must be manually inspected by the developers, so previous research proposed various techniques for automatic correctness assessment of APR-generated patches. Among them, dynamic patch correctness assessment techniques rely on the assumption that, when running the originally passing test cases, the correct patches will not alter the program behavior in a significant way, e.g., removing the code implementing correct functionality of the program. In this paper, we propose and evaluate a novel technique, named Shibboleth, for automatic correctness assessment of the patches generated by test-based generate-and-validate APR systems. Unlike existing works, the impact of the patches is captured along three complementary facets, allowing more effective patch correctness assessment. Specifically, we measure the impact of patches on both production code (via syntactic and semantic similarity) and test code (via code coverage of passing tests) to separate the patches that result in similar programs and that do not delete desired program elements. Shibboleth assesses the correctness of patches via both ranking and classification. We evaluated Shibboleth on 1,871 patches, generated by 29 Java-based APR systems for Defects4J programs. The technique outperforms state-of-the-art ranking and classification techniques. Specifically, in our ranking data set, in 43% (66%) of the cases, Shibboleth ranks the correct patch in top-1 (top-2) positions, and in classification mode applied on our classification data set, it achieves an accuracy and F1-score of 0.887 and 0.852, respectively.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3533767.3534368","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Test-based generate-and-validate automated program repair (APR) systems often generate many patches that pass the test suite without fixing the bug. The generated patches must be manually inspected by the developers, so previous research proposed various techniques for automatic correctness assessment of APR-generated patches. Among them, dynamic patch correctness assessment techniques rely on the assumption that, when running the originally passing test cases, the correct patches will not alter the program behavior in a significant way, e.g., removing the code implementing correct functionality of the program. In this paper, we propose and evaluate a novel technique, named Shibboleth, for automatic correctness assessment of the patches generated by test-based generate-and-validate APR systems. Unlike existing works, the impact of the patches is captured along three complementary facets, allowing more effective patch correctness assessment. Specifically, we measure the impact of patches on both production code (via syntactic and semantic similarity) and test code (via code coverage of passing tests) to separate the patches that result in similar programs and that do not delete desired program elements. Shibboleth assesses the correctness of patches via both ranking and classification. We evaluated Shibboleth on 1,871 patches, generated by 29 Java-based APR systems for Defects4J programs. The technique outperforms state-of-the-art ranking and classification techniques. Specifically, in our ranking data set, in 43% (66%) of the cases, Shibboleth ranks the correct patch in top-1 (top-2) positions, and in classification mode applied on our classification data set, it achieves an accuracy and F1-score of 0.887 and 0.852, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于补丁对生产和测试代码的影响，自动程序修复中的补丁正确性评估

基于测试的生成并验证自动程序修复(APR)系统通常生成许多补丁，这些补丁通过了测试套件，但没有修复错误。生成的补丁必须由开发人员手工检查，因此以前的研究提出了各种技术来自动评估apr生成的补丁的正确性。其中，动态补丁正确性评估技术依赖于这样一个假设，即在运行最初通过的测试用例时，正确的补丁不会显著改变程序行为，例如，删除实现程序正确功能的代码。在本文中，我们提出并评估了一种名为Shibboleth的新技术，用于对基于测试的生成和验证APR系统生成的补丁进行自动正确性评估。与现有的工作不同，补丁的影响是沿着三个互补的方面捕获的，允许更有效的补丁正确性评估。具体来说，我们测量补丁对生产代码(通过语法和语义相似性)和测试代码(通过通过测试的代码覆盖率)的影响，以分离导致类似程序和不删除所需程序元素的补丁。Shibboleth通过排序和分类来评估补丁的正确性。我们在1871个补丁上对Shibboleth进行了评估，这些补丁由29个基于java的APR系统为缺陷4j程序生成。该技术优于最先进的排名和分类技术。具体来说，在我们的排序数据集中，在43%(66%)的案例中，Shibboleth将正确patch排在top-1 (top-2)的位置，在我们的分类数据集中应用的分类模式下，它的准确率和f1得分分别达到了0.887和0.852。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

自引率

0.00%

发文量

期刊最新文献

One step further: evaluating interpreters using metamorphic testing Faster mutation analysis with MeMu Test mimicry to assess the exploitability of library vulnerabilities A large-scale study of usability criteria addressed by static analysis tools NCScope: hardware-assisted analyzer for native code in Android apps