ACM Transactions on Software Engineering and Methodology最新文献_第4页

Testing Multi-Subroutine Quantum Programs: From Unit Testing to Integration Testing 测试多子程序量子程序：从单元测试到集成测试

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-04-05 DOI: 10.1145/3656339

Peixun Long, Jianjun Zhao

Quantum computing has emerged as a promising field with the potential to revolutionize various domains by harnessing the principles of quantum mechanics. As quantum hardware and algorithms continue to advance, developing high-quality quantum software has become crucial. However, testing quantum programs poses unique challenges due to the distinctive characteristics of quantum systems and the complexity of multi-subroutine programs. This paper addresses the specific testing requirements of multi-subroutine quantum programs. We begin by investigating critical properties by surveying existing quantum libraries and providing insights into the challenges of testing these programs. Building upon this understanding, we focus on testing criteria and techniques based on the whole testing process perspective, spanning from unit testing to integration testing. We delve into various aspects, including IO analysis, quantum relation checking, structural testing, behavior testing, integration of subroutine pairs, and test case generation. We also introduce novel testing principles and criteria to guide the testing process. We conduct comprehensive testing on typical quantum subroutines, including diverse mutants and randomized inputs, to evaluate our proposed approach. The analysis of failures provides valuable insights into the effectiveness of our testing methodology. Additionally, we present case studies on representative multi-subroutine quantum programs, demonstrating the practical application and effectiveness of our proposed testing principles and criteria.

量子计算已成为一个前景广阔的领域，利用量子力学原理，它有可能给各个领域带来革命性的变化。随着量子硬件和算法的不断进步，开发高质量的量子软件变得至关重要。然而，由于量子系统的独特性和多子程序的复杂性，量子程序的测试面临着独特的挑战。本文探讨了多子程序量子程序的特定测试要求。我们首先通过调查现有的量子程序库来研究其关键特性，并深入探讨测试这些程序所面临的挑战。在此基础上，我们从单元测试到集成测试的整个测试过程角度出发，重点研究了测试标准和技术。我们深入探讨了各个方面，包括 IO 分析、量子关系检查、结构测试、行为测试、子程序对的集成以及测试用例生成。我们还引入了新的测试原则和标准来指导测试过程。我们对典型的量子子程序（包括各种突变体和随机输入）进行了全面测试，以评估我们提出的方法。对失败的分析为了解我们测试方法的有效性提供了宝贵的见解。此外，我们还对具有代表性的多子程序量子程序进行了案例研究，展示了我们提出的测试原则和标准的实际应用和有效性。

{"title":"Testing Multi-Subroutine Quantum Programs: From Unit Testing to Integration Testing","authors":"Peixun Long, Jianjun Zhao","doi":"10.1145/3656339","DOIUrl":"https://doi.org/10.1145/3656339","url":null,"abstract":"Quantum computing has emerged as a promising field with the potential to revolutionize various domains by harnessing the principles of quantum mechanics. As quantum hardware and algorithms continue to advance, developing high-quality quantum software has become crucial. However, testing quantum programs poses unique challenges due to the distinctive characteristics of quantum systems and the complexity of multi-subroutine programs. This paper addresses the specific testing requirements of multi-subroutine quantum programs. We begin by investigating critical properties by surveying existing quantum libraries and providing insights into the challenges of testing these programs. Building upon this understanding, we focus on testing criteria and techniques based on the whole testing process perspective, spanning from unit testing to integration testing. We delve into various aspects, including IO analysis, quantum relation checking, structural testing, behavior testing, integration of subroutine pairs, and test case generation. We also introduce novel testing principles and criteria to guide the testing process. We conduct comprehensive testing on typical quantum subroutines, including diverse mutants and randomized inputs, to evaluate our proposed approach. The analysis of failures provides valuable insights into the effectiveness of our testing methodology. Additionally, we present case studies on representative multi-subroutine quantum programs, demonstrating the practical application and effectiveness of our proposed testing principles and criteria.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"108 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140560998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Impact of Lower Recall and Precision in Defect Prediction for Guiding Search-Based Software Testing 论缺陷预测中较低的召回率和精确率对指导基于搜索的软件测试的影响

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-04-04 DOI: 10.1145/3655022

Anjana Perera, Burak Turhan, Aldeida Aleti, Marcel Böhme

Defect predictors, static bug detectors and humans inspecting the code can propose locations in the program that are more likely to be buggy before they are discovered through testing. Automated test generators such as search-based software testing (SBST) techniques can use this information to direct their search for test cases to likely-buggy code, thus speeding up the process of detecting existing bugs in those locations. Often the predictions given by these tools or humans are imprecise, which can misguide the SBST technique and may deteriorate its performance. In this paper, we study the impact of imprecision in defect prediction on the bug detection effectiveness of SBST.

Our study finds that the recall of the defect predictor, i.e., the proportion of correctly identified buggy code, has a significant impact on bug detection effectiveness of SBST with a large effect size. More precisely, the SBST technique detects 7.5 fewer bugs on average (out of 420 bugs) for every 5% decrements of the recall. On the other hand, the effect of precision, a measure for false alarms, is not of meaningful practical significance as indicated by a very small effect size.

In the context of combining defect prediction and SBST, our recommendation is to increase the recall of defect predictors as a primary objective and precision as a secondary objective. In our experiments, we find that 75% precision is as good as 100% precision. To account for the imprecision of defect predictors, in particular low recall values, SBST techniques should be designed to search for test cases that also cover the predicted non-buggy parts of the program, while prioritising the parts that have been predicted as buggy.

缺陷预测器、静态错误检测器和检查代码的人可以在通过测试发现程序中更有可能出现错误的位置。自动测试生成器（如基于搜索的软件测试（SBST）技术）可以利用这些信息，将测试用例的搜索引向可能存在缺陷的代码，从而加快在这些位置检测现有缺陷的过程。通常情况下，这些工具或人类给出的预测并不精确，这会误导 SBST 技术，并可能降低其性能。在本文中，我们研究了缺陷预测的不精确性对 SBST 的错误检测效果的影响。我们的研究发现，缺陷预测器的召回率（即正确识别出缺陷代码的比例）对 SBST 的缺陷检测效果有显著影响，且影响大小较大。更确切地说，召回率每降低 5%，SBST 技术检测到的错误平均会减少 7.5 个（在 420 个错误中）。另一方面，精确度（误报率的衡量标准）的影响并没有实际意义，因为它的效应大小非常小。在结合缺陷预测和 SBST 的情况下，我们的建议是将提高缺陷预测器的召回率作为首要目标，而将提高精确度作为次要目标。在我们的实验中，我们发现 75% 的精确度和 100% 的精确度一样好。为了考虑缺陷预测器的不精确性，特别是召回值较低的问题，SBST 技术在设计时应注意搜索测试用例，这些测试用例也应涵盖程序中被预测为非缺陷的部分，同时优先考虑被预测为缺陷的部分。

{"title":"On the Impact of Lower Recall and Precision in Defect Prediction for Guiding Search-Based Software Testing","authors":"Anjana Perera, Burak Turhan, Aldeida Aleti, Marcel Böhme","doi":"10.1145/3655022","DOIUrl":"https://doi.org/10.1145/3655022","url":null,"abstract":"Defect predictors, static bug detectors and humans inspecting the code can propose locations in the program that are more likely to be buggy before they are discovered through testing. Automated test generators such as search-based software testing (SBST) techniques can use this information to direct their search for test cases to likely-buggy code, thus speeding up the process of detecting existing bugs in those locations. Often the predictions given by these tools or humans are imprecise, which can misguide the SBST technique and may deteriorate its performance. In this paper, we study the impact of imprecision in defect prediction on the bug detection effectiveness of SBST. Our study finds that the recall of the defect predictor, i.e., the proportion of correctly identified buggy code, has a significant impact on bug detection effectiveness of SBST with a large effect size. More precisely, the SBST technique detects 7.5 fewer bugs on average (out of 420 bugs) for every 5% decrements of the recall. On the other hand, the effect of precision, a measure for false alarms, is not of meaningful practical significance as indicated by a very small effect size. In the context of combining defect prediction and SBST, our recommendation is to increase the recall of defect predictors as a primary objective and precision as a secondary objective. In our experiments, we find that 75% precision is as good as 100% precision. To account for the imprecision of defect predictors, in particular low recall values, SBST techniques should be designed to search for test cases that also cover the predicted non-buggy parts of the program, while prioritising the parts that have been predicted as buggy.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"39 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Navigating the Complexity of Generative AI Adoption in Software Engineering 驾驭软件工程中采用生成式人工智能的复杂性

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-03-28 DOI: 10.1145/3652154

Daniel Russo

This paper explores the adoption of Generative Artificial Intelligence (AI) tools within the domain of software engineering, focusing on the influencing factors at the individual, technological, and social levels. We applied a convergent mixed-methods approach to offer a comprehensive understanding of AI adoption dynamics. We initially conducted a questionnaire survey with 100 software engineers, drawing upon the Technology Acceptance Model (TAM), the Diffusion of Innovation Theory (DOI), and the Social Cognitive Theory (SCT) as guiding theoretical frameworks. Employing the Gioia Methodology, we derived a theoretical model of AI adoption in software engineering: the Human-AI Collaboration and Adaptation Framework (HACAF). This model was then validated using Partial Least Squares – Structural Equation Modeling (PLS-SEM) based on data from 183 software engineers. Findings indicate that at this early stage of AI integration, the compatibility of AI tools within existing development workflows predominantly drives their adoption, challenging conventional technology acceptance theories. The impact of perceived usefulness, social factors, and personal innovativeness seems less pronounced than expected. The study provides crucial insights for future AI tool design and offers a framework for developing effective organizational implementation strategies.

本文探讨了软件工程领域采用生成式人工智能（AI）工具的情况，重点关注个人、技术和社会层面的影响因素。我们采用了一种融合的混合方法，以全面了解人工智能的应用动态。我们首先对 100 名软件工程师进行了问卷调查，以技术接受模型（TAM）、创新扩散理论（DOI）和社会认知理论（SCT）为指导理论框架。通过采用 Gioia 方法论，我们得出了软件工程领域采用人工智能的理论模型：人类-人工智能协作与适应框架（HACAF）。然后，基于 183 名软件工程师的数据，使用偏最小二乘法-结构方程模型（PLS-SEM）对该模型进行了验证。研究结果表明，在人工智能集成的早期阶段，人工智能工具在现有开发工作流程中的兼容性是推动其采用的主要因素，这对传统的技术接受理论提出了挑战。感知有用性、社会因素和个人创新性的影响似乎没有预期的那么明显。这项研究为未来的人工智能工具设计提供了重要启示，并为制定有效的组织实施战略提供了框架。

{"title":"Navigating the Complexity of Generative AI Adoption in Software Engineering","authors":"Daniel Russo","doi":"10.1145/3652154","DOIUrl":"https://doi.org/10.1145/3652154","url":null,"abstract":"This paper explores the adoption of Generative Artificial Intelligence (AI) tools within the domain of software engineering, focusing on the influencing factors at the individual, technological, and social levels. We applied a convergent mixed-methods approach to offer a comprehensive understanding of AI adoption dynamics. We initially conducted a questionnaire survey with 100 software engineers, drawing upon the Technology Acceptance Model (TAM), the Diffusion of Innovation Theory (DOI), and the Social Cognitive Theory (SCT) as guiding theoretical frameworks. Employing the Gioia Methodology, we derived a theoretical model of AI adoption in software engineering: the Human-AI Collaboration and Adaptation Framework (HACAF). This model was then validated using Partial Least Squares – Structural Equation Modeling (PLS-SEM) based on data from 183 software engineers. Findings indicate that at this early stage of AI integration, the compatibility of AI tools within existing development workflows predominantly drives their adoption, challenging conventional technology acceptance theories. The impact of perceived usefulness, social factors, and personal innovativeness seems less pronounced than expected. The study provides crucial insights for future AI tool design and offers a framework for developing effective organizational implementation strategies.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"13 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The IDEA of Us: An Identity-Aware Architecture for Autonomous Systems 我们的 IDEA：自主系统的身份感知架构

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-03-28 DOI: 10.1145/3654439

Carlos Gavidia-Calderon, Anastasia Kordoni, Amel Bennaceur, Mark Levine, Bashar Nuseibeh

Autonomous systems, such as drones and rescue robots, are increasingly used during emergencies. They deliver services and provide situational awareness that facilitate emergency management and response. To do so, they need to interact and cooperate with humans in their environment. Human behaviour is uncertain and complex, so it can be difficult to reason about it formally. In this paper, we propose IDEA: an adaptive software architecture that enables cooperation between humans and autonomous systems, by leveraging in the social identity approach. This approach establishes that group membership drives human behaviour. Identity and group membership are crucial during emergencies, as they influence cooperation among survivors. IDEA systems infer the social identity of surrounding humans, thereby establishing their group membership. By reasoning about groups, we limit the number of cooperation strategies the system needs to explore. IDEA systems select a strategy from the equilibrium analysis of game-theoretic models, that represent interactions between group members and the IDEA system. We demonstrate our approach using a search-and-rescue scenario, in which an IDEA rescue robot optimises evacuation by collaborating with survivors. Using an empirically validated agent-based model, we show that the deployment of the IDEA system can reduce median evacuation time by (13.6% ).

无人机和救援机器人等自主系统在紧急情况下的应用越来越广泛。它们提供服务和态势感知，促进应急管理和响应。为此，它们需要与环境中的人类进行互动与合作。人类行为既不确定又复杂，因此很难对其进行正式推理。在本文中，我们提出了 IDEA：一种自适应软件架构，通过社会身份方法实现人类与自主系统之间的合作。这种方法认为，群体成员身份是人类行为的驱动力。在紧急情况下，身份和群体成员身份至关重要，因为它们会影响幸存者之间的合作。IDEA 系统推断周围人类的社会身份，从而确定他们的群体成员身份。通过对群体的推理，我们限制了系统需要探索的合作策略的数量。IDEA 系统从博弈论模型的均衡分析中选择一种策略，该模型代表了群体成员与 IDEA 系统之间的互动。我们使用一个搜救场景来演示我们的方法，在该场景中，IDEA 救援机器人通过与幸存者合作来优化撤离。通过使用经过经验验证的基于代理的模型，我们表明部署 IDEA 系统可以将中位撤离时间缩短（13.6%）。

{"title":"The IDEA of Us: An Identity-Aware Architecture for Autonomous Systems","authors":"Carlos Gavidia-Calderon, Anastasia Kordoni, Amel Bennaceur, Mark Levine, Bashar Nuseibeh","doi":"10.1145/3654439","DOIUrl":"https://doi.org/10.1145/3654439","url":null,"abstract":"Autonomous systems, such as drones and rescue robots, are increasingly used during emergencies. They deliver services and provide situational awareness that facilitate emergency management and response. To do so, they need to interact and cooperate with humans in their environment. Human behaviour is uncertain and complex, so it can be difficult to reason about it formally. In this paper, we propose IDEA: an adaptive software architecture that enables cooperation between humans and autonomous systems, by leveraging in the social identity approach. This approach establishes that group membership drives human behaviour. Identity and group membership are crucial during emergencies, as they influence cooperation among survivors. IDEA systems infer the social identity of surrounding humans, thereby establishing their group membership. By reasoning about groups, we limit the number of cooperation strategies the system needs to explore. IDEA systems select a strategy from the equilibrium analysis of game-theoretic models, that represent interactions between group members and the IDEA system. We demonstrate our approach using a search-and-rescue scenario, in which an IDEA rescue robot optimises evacuation by collaborating with survivors. Using an empirically validated agent-based model, we show that the deployment of the IDEA system can reduce median evacuation time by (13.6% ).","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"14 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140325149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MTL-TRANSFER: Leveraging Multi-task Learning and Transferred Knowledge for Improving Fault Localization and Program Repair MTL-TRANSFER：利用多任务学习和转移知识改进故障定位和程序修复

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-03-27 DOI: 10.1145/3654441

Xu Wang, Hongwei Yu, Xiangxin Meng, Hongliang Cao, Hongyu Zhang, Hailong Sun, Xudong Liu, Chunming Hu

Fault localization (FL) and automated program repair (APR) are two main tasks of automatic software debugging. Compared with traditional methods, deep learning-based approaches have been demonstrated to achieve better performance in FL and APR tasks. However, the existing deep learning-based FL methods ignore the deep semantic features or only consider simple code representations. And for APR tasks, existing template-based APR methods are weak in selecting the correct fix templates for more effective program repair, which are also not able to synthesize patches via the embedded end-to-end code modification knowledge obtained by training models on large-scale bug-fix code pairs. Moreover, in most of FL and APR methods, the model designs and training phases are performed separately, leading to ineffective sharing of updated parameters and extracted knowledge during the training process. This limitation hinders the further improvement in the performance of FL and APR tasks. To solve the above problems, we propose a novel approach called MTL-TRANSFER, which leverages a multi-task learning strategy to extract deep semantic features and transferred knowledge from different perspectives. First, we construct a large-scale open-source bug datasets and implement 11 multi-task learning models for bug detection and patch generation sub-tasks on 11 commonly used bug types, as well as one multi-classifier to learn the relevant semantics for the subsequent fix template selection task. Second, an MLP-based ranking model is leveraged to fuse spectrum-based, mutation-based and semantic-based features to generate a sorted list of suspicious statements. Third, we combine the patches generated by the neural patch generation sub-task from the multi-task learning strategy with the optimized fix template selecting order gained from the multi-classifier mentioned above. Finally, the more accurate FL results, the optimized fix template selecting order, and the expanded patch candidates are combined together to further enhance the overall performance of APR tasks. Our extensive experiments on widely-used benchmark Defects4J show that MTL-TRANSFER outperforms all baselines in FL and APR tasks, proving the effectiveness of our approach. Compared with our previously proposed FL method TRANSFER-FL (which is also the state-of-the-art statement-level FL method), MTL-TRANSFER increases the faults hit by 8/11/12 on Top-1/3/5 metrics (92/159/183 in total). And on APR tasks, the number of successfully repaired bugs of MTL-TRANSFER under the perfect localization setting reaches 75, which is 8 more than our previous APR method TRANSFER-PR. Furthermore, another experiment to simulate the actual repair scenarios shows that MTL-TRANSFER can successfully repair 15 and 9 more bugs (56 in total) compared with TBar and TRANSFER, which demonstrates the effectiveness of the combination of our optimized FL and APR components.

故障定位（FL）和自动程序修复（APR）是自动软件调试的两项主要任务。与传统方法相比，基于深度学习的方法在 FL 和 APR 任务中已被证明能取得更好的性能。然而，现有的基于深度学习的 FL 方法忽视了深层语义特征，或者只考虑了简单的代码表示。而对于 APR 任务，现有的基于模板的 APR 方法在选择正确的修复模板以实现更有效的程序修复方面比较薄弱，也无法通过在大规模错误修复代码对上训练模型所获得的嵌入式端到端代码修改知识来合成补丁。此外，在大多数 FL 和 APR 方法中，模型设计和训练阶段是分开进行的，导致在训练过程中无法有效共享更新的参数和提取的知识。这一限制阻碍了 FL 和 APR 任务性能的进一步提高。为了解决上述问题，我们提出了一种名为 MTL-TRANSFER 的新方法，它利用多任务学习策略从不同角度提取深层语义特征和转移知识。首先，我们构建了一个大规模开源错误数据集，并针对 11 种常用错误类型的错误检测和补丁生成子任务实施了 11 个多任务学习模型，以及一个多分类器，用于学习后续修复模板选择任务的相关语义。其次，利用基于 MLP 的排序模型，融合基于频谱、突变和语义的特征，生成可疑语句的排序列表。第三，我们将多任务学习策略中神经补丁生成子任务生成的补丁与上述多分类器获得的优化修复模板选择顺序相结合。最后，将更精确的 FL 结果、优化的固定模板选择顺序和扩展的补丁候选者结合在一起，进一步提高 APR 任务的整体性能。我们在广泛使用的基准 Defects4J 上进行的大量实验表明，MTL-TRANSFER 在 FL 和 APR 任务中的表现优于所有基准，证明了我们方法的有效性。与我们之前提出的 FL 方法 TRANSFER-FL（也是最先进的语句级 FL 方法）相比，MTL-TRANSFER 在 Top-1/3/5 指标上将故障命中率提高了 8/11/12（总计 92/159/183）。在 APR 任务中，MTL-TRANSFER 在完美定位设置下成功修复的错误数量达到 75 个，比我们之前的 APR 方法 TRANSFER-PR 多 8 个。此外，另一项模拟实际修复场景的实验表明，与 TBar 和 TRANSFER 相比，MTL-TRANSFER 分别成功修复了 15 个和 9 个错误（共 56 个），这证明了我们优化的 FL 和 APR 组件组合的有效性。

{"title":"MTL-TRANSFER: Leveraging Multi-task Learning and Transferred Knowledge for Improving Fault Localization and Program Repair","authors":"Xu Wang, Hongwei Yu, Xiangxin Meng, Hongliang Cao, Hongyu Zhang, Hailong Sun, Xudong Liu, Chunming Hu","doi":"10.1145/3654441","DOIUrl":"https://doi.org/10.1145/3654441","url":null,"abstract":"Fault localization (FL) and automated program repair (APR) are two main tasks of automatic software debugging. Compared with traditional methods, deep learning-based approaches have been demonstrated to achieve better performance in FL and APR tasks. However, the existing deep learning-based FL methods ignore the deep semantic features or only consider simple code representations. And for APR tasks, existing template-based APR methods are weak in selecting the correct fix templates for more effective program repair, which are also not able to synthesize patches via the embedded end-to-end code modification knowledge obtained by training models on large-scale bug-fix code pairs. Moreover, in most of FL and APR methods, the model designs and training phases are performed separately, leading to ineffective sharing of updated parameters and extracted knowledge during the training process. This limitation hinders the further improvement in the performance of FL and APR tasks. To solve the above problems, we propose a novel approach called MTL-TRANSFER, which leverages a multi-task learning strategy to extract deep semantic features and transferred knowledge from different perspectives. First, we construct a large-scale open-source bug datasets and implement 11 multi-task learning models for bug detection and patch generation sub-tasks on 11 commonly used bug types, as well as one multi-classifier to learn the relevant semantics for the subsequent fix template selection task. Second, an MLP-based ranking model is leveraged to fuse spectrum-based, mutation-based and semantic-based features to generate a sorted list of suspicious statements. Third, we combine the patches generated by the neural patch generation sub-task from the multi-task learning strategy with the optimized fix template selecting order gained from the multi-classifier mentioned above. Finally, the more accurate FL results, the optimized fix template selecting order, and the expanded patch candidates are combined together to further enhance the overall performance of APR tasks. Our extensive experiments on widely-used benchmark Defects4J show that MTL-TRANSFER outperforms all baselines in FL and APR tasks, proving the effectiveness of our approach. Compared with our previously proposed FL method TRANSFER-FL (which is also the state-of-the-art statement-level FL method), MTL-TRANSFER increases the faults hit by 8/11/12 on Top-1/3/5 metrics (92/159/183 in total). And on APR tasks, the number of successfully repaired bugs of MTL-TRANSFER under the perfect localization setting reaches 75, which is 8 more than our previous APR method TRANSFER-PR. Furthermore, another experiment to simulate the actual repair scenarios shows that MTL-TRANSFER can successfully repair 15 and 9 more bugs (56 in total) compared with TBar and TRANSFER, which demonstrates the effectiveness of the combination of our optimized FL and APR components.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"26 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Early and Realistic Exploitability Prediction of Just-Disclosed Software Vulnerabilities: How Reliable Can It Be? 对刚披露的软件漏洞进行早期和现实的可利用性预测：它有多可靠？

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-03-27 DOI: 10.1145/3654443

Emanuele Iannone, Giulia Sellitto, Emanuele Iaccarino, Filomena Ferrucci, Andrea De Lucia, Fabio Palomba

With the rate of discovered and disclosed vulnerabilities escalating, researchers have been experimenting with machine learning to predict whether a vulnerability will be exploited. Existing solutions leverage information unavailable when a CVE is created, making them unsuitable just after the disclosure. This paper experiments with early exploitability prediction models driven exclusively by the initial CVE record, i.e., the original description and the linked online discussions. Leveraging NVD and Exploit Database, we evaluate 72 prediction models trained using six traditional machine learning classifiers, four feature representation schemas, and three data balancing algorithms. We also experiment with five pre-trained large language models (LLMs). The models leverage seven different corpora made by combining three data sources, i.e., CVE description, Security Focus, and BugTraq. The models are evaluated in a realistic, time-aware fashion by removing the training and test instances that cannot be labeled “neutral” with sufficient confidence. The validation reveals that CVE descriptions and Security Focus discussions are the best data to train on. Pre-trained LLMs do not show the expected performance, requiring further pre-training in the security domain. We distill new research directions, identify possible room for improvement, and envision automated systems assisting security experts in assessing the exploitability.

随着漏洞被发现和披露的速度不断加快，研究人员一直在尝试利用机器学习来预测漏洞是否会被利用。现有的解决方案利用的是 CVE 创建时无法获得的信息，因此不适合在漏洞刚刚披露后使用。本文实验了完全由初始 CVE 记录（即原始描述和链接的在线讨论）驱动的早期可利用性预测模型。利用 NVD 和漏洞利用数据库，我们评估了使用六种传统机器学习分类器、四种特征表示模式和三种数据平衡算法训练的 72 个预测模型。我们还使用五个预训练的大型语言模型（LLM）进行了实验。这些模型利用了由三个数据源（即 CVE 描述、Security Focus 和 BugTraq）组合而成的七个不同的语料库。通过移除无法以足够置信度标记为 "中性 "的训练和测试实例，以现实的、时间感知的方式对模型进行了评估。验证结果表明，CVE 描述和安全焦点讨论是最佳的训练数据。预训练的 LLM 没有显示出预期的性能，因此需要在安全领域进行进一步的预训练。我们提炼出了新的研究方向，确定了可能的改进空间，并设想了协助安全专家评估可利用性的自动化系统。

{"title":"Early and Realistic Exploitability Prediction of Just-Disclosed Software Vulnerabilities: How Reliable Can It Be?","authors":"Emanuele Iannone, Giulia Sellitto, Emanuele Iaccarino, Filomena Ferrucci, Andrea De Lucia, Fabio Palomba","doi":"10.1145/3654443","DOIUrl":"https://doi.org/10.1145/3654443","url":null,"abstract":"With the rate of discovered and disclosed vulnerabilities escalating, researchers have been experimenting with machine learning to predict whether a vulnerability will be exploited. Existing solutions leverage information unavailable when a CVE is created, making them unsuitable just after the disclosure. This paper experiments with early exploitability prediction models driven exclusively by the initial CVE record, i.e., the original description and the linked online discussions. Leveraging NVD and Exploit Database, we evaluate 72 prediction models trained using six traditional machine learning classifiers, four feature representation schemas, and three data balancing algorithms. We also experiment with five pre-trained large language models (LLMs). The models leverage seven different corpora made by combining three data sources, i.e., CVE description, Security Focus, and BugTraq. The models are evaluated in a realistic, time-aware fashion by removing the training and test instances that cannot be labeled “neutral” with sufficient confidence. The validation reveals that CVE descriptions and Security Focus discussions are the best data to train on. Pre-trained LLMs do not show the expected performance, requiring further pre-training in the security domain. We distill new research directions, identify possible room for improvement, and envision automated systems assisting security experts in assessing the exploitability.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"6 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Way to SBOMs: Investigating Design Issues and Solutions in Practice 在通往 SBOM 的道路上：调查实践中的设计问题和解决方案

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-03-26 DOI: 10.1145/3654442

Tingting Bi, Boming Xia, Zhenchang Xing, Qinghua Lu, Liming Zhu

The increase of software supply chain threats has underscored the necessity for robust security mechanisms, among which the Software Bill of Materials (SBOM) stands out as a promising solution. SBOMs, by providing a machine-readable inventory of software composition details, play a crucial role in enhancing transparency and traceability within software supply chains. This empirical study delves into the practical challenges and solutions associated with the adoption of SBOMs, through an analysis of 4,786 GitHub discussions across 510 SBOM-related projects. Through repository mining and analysis, this research delineates key topics, challenges, and solutions intrinsic to the effective utilization of SBOMs. Furthermore, we shed light on commonly used tools and frameworks for SBOM generation, exploring their respective strengths and limitations. This study underscores a set of findings, for example, there are four phases of the SBOM life cycle, and each phase has a set of SBOM development activities and issues; in addition, this study emphasizes that SBOM play in ensuring resilient software development practices and the imperative of their widespread adoption and integration to bolster supply chain security. The insights of our study provide vital input for future work and practical advancements in this topic.

软件供应链威胁的增加凸显了建立强大安全机制的必要性，其中软件物料清单（SBOM）是一个很有前途的解决方案。SBOM 提供了一份机器可读的软件组成细节清单，在提高软件供应链的透明度和可追溯性方面发挥着至关重要的作用。本实证研究通过分析 GitHub 上 510 个 SBOM 相关项目的 4,786 条讨论，深入探讨了与采用 SBOM 相关的实际挑战和解决方案。通过对资源库的挖掘和分析，本研究勾勒出了有效利用 SBOMs 所固有的关键主题、挑战和解决方案。此外，我们还揭示了生成 SBOM 的常用工具和框架，探讨了它们各自的优势和局限性。本研究强调了一系列发现，例如，SBOM 生命周期分为四个阶段，每个阶段都有一系列 SBOM 开发活动和问题；此外，本研究还强调了 SBOM 在确保弹性软件开发实践中的作用，以及广泛采用和整合 SBOM 以加强供应链安全的必要性。我们的研究见解为今后的工作和这一主题的实际进展提供了重要的参考。

{"title":"On the Way to SBOMs: Investigating Design Issues and Solutions in Practice","authors":"Tingting Bi, Boming Xia, Zhenchang Xing, Qinghua Lu, Liming Zhu","doi":"10.1145/3654442","DOIUrl":"https://doi.org/10.1145/3654442","url":null,"abstract":"The increase of software supply chain threats has underscored the necessity for robust security mechanisms, among which the Software Bill of Materials (SBOM) stands out as a promising solution. SBOMs, by providing a machine-readable inventory of software composition details, play a crucial role in enhancing transparency and traceability within software supply chains. This empirical study delves into the practical challenges and solutions associated with the adoption of SBOMs, through an analysis of 4,786 GitHub discussions across 510 SBOM-related projects. Through repository mining and analysis, this research delineates key topics, challenges, and solutions intrinsic to the effective utilization of SBOMs. Furthermore, we shed light on commonly used tools and frameworks for SBOM generation, exploring their respective strengths and limitations. This study underscores a set of findings, for example, there are four phases of the SBOM life cycle, and each phase has a set of SBOM development activities and issues; in addition, this study emphasizes that SBOM play in ensuring resilient software development practices and the imperative of their widespread adoption and integration to bolster supply chain security. The insights of our study provide vital input for future work and practical advancements in this topic.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"1 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140300659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On Estimating the Feasible Solution Space of Multi-Objective Testing Resource Allocation 论估算多目标测试资源分配的可行解空间

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-03-26 DOI: 10.1145/3654444

Guofu Zhang, Lei Li, Zhaopin Su, Feng Yue, Yang Chen, Miqing Li, Xin Yao

The multi-objective testing resource allocation problem (MOTRAP) is concerned on how to reasonably plan the testing time of software testers to save the cost and improve the reliability as much as possible. The feasible solution space of a MOTRAP is determined by its variables (i.e., the time invested in each component) and constraints (e.g., the pre-specified reliability, cost, or time). Although a variety of state-of-the-art constrained multi-objective optimisers can be used to find individual solutions in this space, their search remains inefficient and expensive due to the fact that this space is very tiny compare to the large search space. The decision maker may often suffer a prolonged but unsuccessful search that fails to return a feasible solution. In this work, we first formulate a heavily constrained MOTRAP on the basis of an architecture-based model, in which reliability, cost, and time are optimised under the pre-specified multiple constraints on reliability, cost, and time. Then, to estimate the feasible solution space of this specific MOTRAP, we develop theoretical and algorithmic approaches to deduce new tighter lower and upper bounds on variables from constraints. Importantly, our approach can help the decision maker identify whether their constraint settings are practicable, and meanwhile, the derived bounds can just enclose the tiny feasible solution space and help off-the-shelf constrained multi-objective optimisers make the search within the feasible solution space as much as possible. Additionally, to further make good use of these bounds, we propose a generalised bound constraint handling method that can be readily employed by constrained multi-objective optimisers to pull infeasible solutions back into the estimated space with theoretical guarantee. Finally, we evaluate our approach on application and empirical cases. Experimental results reveal that our approach significantly enhances the efficiency, effectiveness, and robustness of off-the-shelf constrained multi-objective optimisers and state-of-the-art bound constraint handling methods at finding high-quality solutions for the decision maker. These improvements may help the decision maker take the stress out of setting constraints and selecting constrained multi-objective optimisers and facilitate the testing planning more efficiently and effectively.

多目标测试资源分配问题（MOTRAP）关注的是如何合理规划软件测试人员的测试时间，以尽可能地节约成本和提高可靠性。MOTRAP 的可行解空间由其变量（即在每个组件上投入的时间）和约束条件（如预先指定的可靠性、成本或时间）决定。尽管可以使用各种最先进的受限多目标优化器在这一空间中寻找单独的解决方案，但由于这一空间与庞大的搜索空间相比非常狭小，因此其搜索效率低下且成本高昂。决策者可能经常要经历漫长但不成功的搜索，无法找到可行的解决方案。在这项工作中，我们首先以基于架构的模型为基础，提出了一个重约束 MOTRAP，其中在预先指定的可靠性、成本和时间等多重约束条件下，对可靠性、成本和时间进行了优化。然后，为了估算这一特定 MOTRAP 的可行解空间，我们开发了理论和算法方法，从约束条件中推导出新的更严格的变量下限和上限。重要的是，我们的方法可以帮助决策者识别他们的约束条件设置是否可行，同时，推导出的约束条件正好可以包围微小的可行解空间，帮助现成的约束多目标优化器尽可能在可行解空间内进行搜索。此外，为了进一步充分利用这些边界，我们提出了一种广义的边界约束处理方法，约束多目标优化器可以利用这种方法，在理论上保证将不可行解拉回估计空间。最后，我们在应用和经验案例中对我们的方法进行了评估。实验结果表明，我们的方法大大提高了现成的约束多目标优化器和最先进的约束条件处理方法的效率、有效性和稳健性，为决策者找到了高质量的解决方案。这些改进可以帮助决策者减轻设置约束条件和选择约束多目标优化器的压力，更高效、更有效地促进测试规划。

{"title":"On Estimating the Feasible Solution Space of Multi-Objective Testing Resource Allocation","authors":"Guofu Zhang, Lei Li, Zhaopin Su, Feng Yue, Yang Chen, Miqing Li, Xin Yao","doi":"10.1145/3654444","DOIUrl":"https://doi.org/10.1145/3654444","url":null,"abstract":"The multi-objective testing resource allocation problem (MOTRAP) is concerned on how to reasonably plan the testing time of software testers to save the cost and improve the reliability as much as possible. The feasible solution space of a MOTRAP is determined by its variables (i.e., the time invested in each component) and constraints (e.g., the pre-specified reliability, cost, or time). Although a variety of state-of-the-art constrained multi-objective optimisers can be used to find individual solutions in this space, their search remains inefficient and expensive due to the fact that this space is very tiny compare to the large search space. The decision maker may often suffer a prolonged but unsuccessful search that fails to return a feasible solution. In this work, we first formulate a heavily constrained MOTRAP on the basis of an architecture-based model, in which reliability, cost, and time are optimised under the pre-specified multiple constraints on reliability, cost, and time. Then, to estimate the feasible solution space of this specific MOTRAP, we develop theoretical and algorithmic approaches to deduce new tighter lower and upper bounds on variables from constraints. Importantly, our approach can help the decision maker identify whether their constraint settings are practicable, and meanwhile, the derived bounds can just enclose the tiny feasible solution space and help off-the-shelf constrained multi-objective optimisers make the search within the feasible solution space as much as possible. Additionally, to further make good use of these bounds, we propose a generalised bound constraint handling method that can be readily employed by constrained multi-objective optimisers to pull infeasible solutions back into the estimated space with theoretical guarantee. Finally, we evaluate our approach on application and empirical cases. Experimental results reveal that our approach significantly enhances the efficiency, effectiveness, and robustness of off-the-shelf constrained multi-objective optimisers and state-of-the-art bound constraint handling methods at finding high-quality solutions for the decision maker. These improvements may help the decision maker take the stress out of setting constraints and selecting constrained multi-objective optimisers and facilitate the testing planning more efficiently and effectively.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"84 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DiPri: Distance-based Seed Prioritization for Greybox Fuzzing DiPri：基于距离的灰盒模糊种子优先级排序

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-03-26 DOI: 10.1145/3654440

Ruixiang Qian, Quanjun Zhang, Chunrong Fang, Ding Yang, Shun Li, Binyu Li, Zhenyu Chen

Greybox fuzzing is a powerful testing technique. Given a set of initial seeds, greybox fuzzing continuously generates new test inputs to execute the program under test and drives executions with code coverage as feedback. Seed prioritization is an important step of greybox fuzzing that helps greybox fuzzing choose promising seeds for input generation in priority. However, mainstream greybox fuzzers like AFL++ and Zest tend to neglect the importance of seed prioritization. They may pick seeds plainly according to the sequential order of the seeds being queued or an order produced with a random-based approach, which may consequently degrade their performance in exploring code and exposing bugs. In the meantime, existing state-of-the-art techniques like Alphuzz and K-Scheduler adopt complex strategies to schedule seeds. Although powerful, such strategies also inevitably incur great overhead and will reduce the scalability of the proposed technique.

In this paper, we propose a novel distance-based seed prioritization approach named DiPri to facilitate greybox fuzzing.

Specifically, DiPri evaluates the queued seeds according to seed distances and chooses the outlier ones, which are the farthest from the others, in priority to improve the probabilities of discovering previously unexplored code regions. To make a profound evaluation of DiPri, we prototype DiPri on AFL++ and conduct large-scale experiments with four baselines and 24 C/C++ fuzz targets, where eight are from widely adopted real-world projects, eight are from the coverage-based benchmark FuzzBench, and eight are from the bug-based benchmark Magma. The results obtained through a fuzzing exceeding 50,000 CPU hours suggest that DiPri can (1) insignificantly influence the host fuzzer’s capability of code coverage by slightly improving the branch coverage on the eight targets from real-world projects and slightly reducing the branch coverage on the eight targets from FuzzBench, and (2) improve the host fuzzer’s capability of finding bugs by triggering five more Magma bugs. Besides the evaluation with the three C/C++ benchmarks, we integrate DiPri into the Java fuzzer Zest and conduct experiments on a Java benchmark composed of five real-world programs for more than 8,000 CPU hours to empirically study the scalability of DiPri. The results with the Java benchmark demonstrate that DiPri is pretty scalable and can help the host fuzzer find bugs more consistently.

灰盒模糊是一种强大的测试技术。给定一组初始种子后，灰盒模糊会不断生成新的测试输入来执行被测程序，并以代码覆盖率作为反馈来驱动执行。种子优先级是灰盒模糊的一个重要步骤，它有助于灰盒模糊优先选择有希望的种子来生成输入。然而，AFL++ 和 Zest 等主流灰盒模糊器往往忽视了种子优先级的重要性。它们可能只是按照种子排队的顺序或基于随机的方法产生的顺序来选择种子，这可能会降低它们探索代码和暴露漏洞的性能。与此同时，Alphuzz 和 K-Scheduler 等现有的先进技术采用了复杂的策略来调度种子。这些策略虽然功能强大，但也不可避免地会产生巨大的开销，并会降低所提技术的可扩展性。在本文中，我们提出了一种名为 DiPri 的基于距离的新型种子优先级排序方法，以促进灰盒模糊处理。具体来说，DiPri 根据种子距离评估排队的种子，并优先选择离其他种子最远的离群种子，以提高发现先前未探索代码区域的概率。为了对 DiPri 进行深入评估，我们在 AFL++ 上建立了 DiPri 原型，并用四个基线和 24 个 C/C++ 模糊目标进行了大规模实验，其中八个来自广泛采用的实际项目，八个来自基于覆盖率的基准 FuzzBench，八个来自基于错误的基准 Magma。超过 50,000 个 CPU 小时的模糊测试结果表明，DiPri 可以：（1）通过略微提高来自实际项目的 8 个目标的分支覆盖率和略微降低来自 FuzzBench 的 8 个目标的分支覆盖率，对主机模糊器的代码覆盖能力影响不大；（2）通过多触发 5 个 Magma 错误，提高主机模糊器发现错误的能力。除了对三个 C/C++ 基准进行评估外，我们还将 DiPri 集成到 Java 模糊器 Zest 中，并在由五个真实世界程序组成的 Java 基准上进行了超过 8000 个 CPU 小时的实验，对 DiPri 的可扩展性进行了实证研究。Java 基准测试的结果表明，DiPri 具有很强的可扩展性，可以帮助主机模糊器更稳定地发现漏洞。

{"title":"DiPri: Distance-based Seed Prioritization for Greybox Fuzzing","authors":"Ruixiang Qian, Quanjun Zhang, Chunrong Fang, Ding Yang, Shun Li, Binyu Li, Zhenyu Chen","doi":"10.1145/3654440","DOIUrl":"https://doi.org/10.1145/3654440","url":null,"abstract":"Greybox fuzzing is a powerful testing technique. Given a set of initial seeds, greybox fuzzing continuously generates new test inputs to execute the program under test and drives executions with code coverage as feedback. Seed prioritization is an important step of greybox fuzzing that helps greybox fuzzing choose promising seeds for input generation in priority. However, mainstream greybox fuzzers like AFL++ and Zest tend to neglect the importance of seed prioritization. They may pick seeds plainly according to the sequential order of the seeds being queued or an order produced with a random-based approach, which may consequently degrade their performance in exploring code and exposing bugs. In the meantime, existing state-of-the-art techniques like Alphuzz and K-Scheduler adopt complex strategies to schedule seeds. Although powerful, such strategies also inevitably incur great overhead and will reduce the scalability of the proposed technique. In this paper, we propose a novel distance-based seed prioritization approach named DiPri to facilitate greybox fuzzing. Specifically, DiPri evaluates the queued seeds according to seed distances and chooses the outlier ones, which are the farthest from the others, in priority to improve the probabilities of discovering previously unexplored code regions. To make a profound evaluation of DiPri, we prototype DiPri on AFL++ and conduct large-scale experiments with four baselines and 24 C/C++ fuzz targets, where eight are from widely adopted real-world projects, eight are from the coverage-based benchmark FuzzBench, and eight are from the bug-based benchmark Magma. The results obtained through a fuzzing exceeding 50,000 CPU hours suggest that DiPri can (1) insignificantly influence the host fuzzer’s capability of code coverage by slightly improving the branch coverage on the eight targets from real-world projects and slightly reducing the branch coverage on the eight targets from FuzzBench, and (2) improve the host fuzzer’s capability of finding bugs by triggering five more Magma bugs. Besides the evaluation with the three C/C++ benchmarks, we integrate DiPri into the Java fuzzer Zest and conduct experiments on a Java benchmark composed of five real-world programs for more than 8,000 CPU hours to empirically study the scalability of DiPri. The results with the Java benchmark demonstrate that DiPri is pretty scalable and can help the host fuzzer find bugs more consistently.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"30 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140300916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DinoDroid: Testing Android Apps Using Deep Q-Networks DinoDroid：使用深度 Q 网络测试 Android 应用程序

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-03-14 DOI: 10.1145/3652150

Yu Zhao, Brent Harrison, Tingting Yu

The large demand of mobile devices creates significant concerns about the quality of mobile applications (apps). Developers need to guarantee the quality of mobile apps before it is released to the market. There have been many approaches using different strategies to test the GUI of mobile apps. However, they still need improvement due to their limited effectiveness. In this paper, we propose DinoDroid, an approach based on deep Q-networks to automate testing of Android apps. DinoDroid learns a behavior model from a set of existing apps and the learned model can be used to explore and generate tests for new apps. DinoDroid is able to capture the fine-grained details of GUI events (e.g., the content of GUI widgets) and use them as features that are fed into deep neural network, which acts as the agent to guide app exploration. DinoDroid automatically adapts the learned model during the exploration without the need of any modeling strategies or pre-defined rules. We conduct experiments on 64 open-source Android apps. The results showed that DinoDroid outperforms existing Android testing tools in terms of code coverage and bug detection.

移动设备的巨大需求引发了人们对移动应用程序（App）质量的极大关注。开发人员需要在移动应用程序投放市场前保证其质量。目前已有许多方法使用不同的策略来测试移动应用程序的图形用户界面。然而，由于效果有限，这些方法仍需改进。在本文中，我们提出了一种基于深度 Q 网络的方法 DinoDroid，用于自动测试安卓应用程序。DinoDroid 可从一组现有应用程序中学习行为模型，学习到的模型可用于探索和生成新应用程序的测试。DinoDroid 能够捕捉图形用户界面事件的细粒度细节（如图形用户界面部件的内容），并将其作为特征输入深度神经网络，而深度神经网络则作为代理引导应用程序的探索。DinoDroid 可在探索过程中自动调整所学模型，而无需任何建模策略或预定义规则。我们在 64 个开源安卓应用程序上进行了实验。结果表明，DinoDroid 在代码覆盖率和错误检测方面优于现有的安卓测试工具。

引用次数: 0