ACM Transactions on Software Engineering and Methodology最新文献_第10页

Deceiving Humans and Machines Alike: Search-based Test Input Generation for DNNs using Variational Autoencoders 欺骗人类和机器：使用变异自动编码器为 DNN 生成基于搜索的测试输入

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2023-12-21 DOI: 10.1145/3635706

Sungmin Kang, Robert Feldt, Shin Yoo

Due to the rapid adoption of Deep Neural Networks (DNNs) into larger software systems, testing of DNN based systems has received much attention recently. While many different test adequacy criteria have been suggested, we lack effective test input generation techniques. Inputs such as images of real world objects and scenes are not only expensive to collect but also difficult to randomly sample. Consequently, current testing techniques for DNNs tend to apply small local perturbations to existing inputs to generate new inputs. We propose SINVAD, a way to sample from, and navigate over, a space of realistic inputs that resembles the true distribution in the training data. Our input space is constructed using Variational AutoEncoders (VAEs), and navigated through their latent vector space. Our analysis shows that the VAE-based input space is well-aligned with human perception of what constitutes realistic inputs. Further, we show that this space can be effectively searched to achieve various testing scenarios, such as boundary testing of two different DNNs or analyzing class labels that are difficult for the given DNN to distinguish. Guidelines on how to design VAE architectures are presented as well. Our results have the potential to open the field to meaningful exploration through the space of highly structured images.

由于深度神经网络（DNN）在大型软件系统中的快速应用，基于 DNN 的系统测试近来备受关注。虽然已经提出了许多不同的测试充分性标准，但我们缺乏有效的测试输入生成技术。真实世界物体和场景的图像等输入不仅收集成本高昂，而且难以随机抽样。因此，目前的 DNN 测试技术倾向于对现有输入应用小的局部扰动来生成新的输入。我们提出的 SINVAD 是一种从现实输入空间采样并在其上导航的方法，该输入空间与训练数据中的真实分布相似。我们的输入空间是使用变异自动编码器（VAE）构建的，并通过其潜在向量空间进行导航。我们的分析表明，基于变异自动编码器的输入空间非常符合人类对真实输入的感知。此外，我们还表明，可以有效地搜索该空间，以实现各种测试场景，例如对两个不同 DNN 进行边界测试，或分析给定 DNN 难以区分的类标签。我们还提出了如何设计 VAE 架构的指导原则。我们的研究成果有可能为在高结构化图像空间中进行有意义的探索开辟新的领域。

{"title":"Deceiving Humans and Machines Alike: Search-based Test Input Generation for DNNs using Variational Autoencoders","authors":"Sungmin Kang, Robert Feldt, Shin Yoo","doi":"10.1145/3635706","DOIUrl":"https://doi.org/10.1145/3635706","url":null,"abstract":"Due to the rapid adoption of Deep Neural Networks (DNNs) into larger software systems, testing of DNN based systems has received much attention recently. While many different test adequacy criteria have been suggested, we lack effective test input generation techniques. Inputs such as images of real world objects and scenes are not only expensive to collect but also difficult to randomly sample. Consequently, current testing techniques for DNNs tend to apply small local perturbations to existing inputs to generate new inputs. We propose SINVAD, a way to sample from, and navigate over, a space of realistic inputs that resembles the true distribution in the training data. Our input space is constructed using Variational AutoEncoders (VAEs), and navigated through their latent vector space. Our analysis shows that the VAE-based input space is well-aligned with human perception of what constitutes realistic inputs. Further, we show that this space can be effectively searched to achieve various testing scenarios, such as boundary testing of two different DNNs or analyzing class labels that are difficult for the given DNN to distinguish. Guidelines on how to design VAE architectures are presented as well. Our results have the potential to open the field to meaningful exploration through the space of highly structured images.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"1 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138823838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Test Generation Strategies for Building Failure Models and Explaining Spurious Failures 建立故障模型和解释假故障的测试生成策略

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2023-12-21 DOI: 10.1145/3638246

Baharin A. Jodat, Abhishek Chandar, Shiva Nejati, Mehrdad Sabetzadeh

Test inputs fail not only when the system under test is faulty but also when the inputs are invalid or unrealistic. Failures resulting from invalid or unrealistic test inputs are spurious. Avoiding spurious failures improves the effectiveness of testing in exercising the main functions of a system, particularly for compute-intensive (CI) systems where a single test execution takes significant time. In this paper, we propose to build failure models for inferring interpretable rules on test inputs that cause spurious failures. We examine two alternative strategies for building failure models: (1) machine learning (ML)-guided test generation and (2) surrogate-assisted test generation. ML-guided test generation infers boundary regions that separate passing and failing test inputs and samples test inputs from those regions. Surrogate-assisted test generation relies on surrogate models to predict labels for test inputs instead of exercising all the inputs. We propose a novel surrogate-assisted algorithm that uses multiple surrogate models simultaneously, and dynamically selects the prediction from the most accurate model. We empirically evaluate the accuracy of failure models inferred based on surrogate-assisted and ML-guided test generation algorithms. Using case studies from the domains of cyber-physical systems and networks, we show that our proposed surrogate-assisted approach generates failure models with an average accuracy of 83%, significantly outperforming ML-guided test generation and two baselines. Further, our approach learns failure-inducing rules that identify genuine spurious failures as validated against domain knowledge.

测试输入不仅会在被测系统出现故障时失效，而且会在输入无效或不切实际时失效。无效或不切实际的测试输入导致的故障是假故障。避免虚假故障可以提高测试的有效性，从而检验系统的主要功能，尤其是对于计算密集型（CI）系统，因为在这种系统中，执行一次测试需要花费大量时间。在本文中，我们建议建立故障模型，以推断导致虚假故障的测试输入的可解释规则。我们研究了建立故障模型的两种备选策略：(1) 机器学习（ML）指导下的测试生成和 (2) 代理辅助测试生成。机器学习指导下的测试生成会推断出分隔合格和不合格测试输入的边界区域，并从这些区域对测试输入进行采样。代理辅助测试生成依赖于代理模型来预测测试输入的标签，而不是对所有输入进行测试。我们提出了一种新颖的代用辅助算法，该算法可同时使用多个代用模型，并动态选择最准确的模型进行预测。我们对基于代理辅助算法和 ML 引导测试生成算法推断出的故障模型的准确性进行了实证评估。通过对网络物理系统和网络领域的案例研究，我们发现我们提出的代理辅助方法生成故障模型的平均准确率为 83%，明显优于 ML 引导测试生成算法和两种基线算法。此外，我们的方法还能学习故障诱发规则，并根据领域知识进行验证，从而识别真正的虚假故障。

{"title":"Test Generation Strategies for Building Failure Models and Explaining Spurious Failures","authors":"Baharin A. Jodat, Abhishek Chandar, Shiva Nejati, Mehrdad Sabetzadeh","doi":"10.1145/3638246","DOIUrl":"https://doi.org/10.1145/3638246","url":null,"abstract":"Test inputs fail not only when the system under test is faulty but also when the inputs are invalid or unrealistic. Failures resulting from invalid or unrealistic test inputs are spurious. Avoiding spurious failures improves the effectiveness of testing in exercising the main functions of a system, particularly for compute-intensive (CI) systems where a single test execution takes significant time. In this paper, we propose to build failure models for inferring interpretable rules on test inputs that cause spurious failures. We examine two alternative strategies for building failure models: (1) machine learning (ML)-guided test generation and (2) surrogate-assisted test generation. ML-guided test generation infers boundary regions that separate passing and failing test inputs and samples test inputs from those regions. Surrogate-assisted test generation relies on surrogate models to predict labels for test inputs instead of exercising all the inputs. We propose a novel surrogate-assisted algorithm that uses multiple surrogate models simultaneously, and dynamically selects the prediction from the most accurate model. We empirically evaluate the accuracy of failure models inferred based on surrogate-assisted and ML-guided test generation algorithms. Using case studies from the domains of cyber-physical systems and networks, we show that our proposed surrogate-assisted approach generates failure models with an average accuracy of 83%, significantly outperforming ML-guided test generation and two baselines. Further, our approach learns failure-inducing rules that identify genuine spurious failures as validated against domain knowledge.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"81 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139028039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep Learning Projects 超越准确性：开源深度学习项目中的单元测试实证研究

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2023-12-20 DOI: 10.1145/3638245

Han Wang, Sijia Yu, Chunyang Chen, Burak Turhan, Xiaodong Zhu

Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness. However, it is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems. Therefore, we empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub. We find that: 1) unit tested DL projects have positive correlation with the open-source project metrics and have a higher acceptance rate of pull requests, 2) 68% of the sampled DL projects are not unit tested at all, 3) the layer and utilities (utils) of DL models have the most unit tests. Based on these findings and previous research outcomes, we built a mapping taxonomy between unit tests and faults in DL projects. We discuss the implications of our findings for developers and researchers and highlight the need for unit testing in open-source DL projects to ensure their reliability and stability. The study contributes to this community by raising awareness of the importance of unit testing in DL projects and encouraging further research in this area.

深度学习（DL）模型发展迅速，重点是通过测试模型的准确性和鲁棒性来实现高性能。然而，当需要像对待其他软件系统一样对待和测试 DL 项目时，还不清楚 DL 项目作为软件系统是否经过了全面测试或功能正确性测试。因此，我们通过分析 GitHub 上的 9,129 个项目，对开源 DL 项目中的单元测试进行了实证研究。我们发现1）经过单元测试的 DL 项目与开源项目指标呈正相关，并且具有更高的拉取请求接受率；2）68% 的抽样 DL 项目根本没有经过单元测试；3）DL 模型的层和实用程序（utils）拥有最多的单元测试。基于这些发现和先前的研究成果，我们建立了 DL 项目中单元测试和故障之间的映射分类法。我们讨论了我们的发现对开发人员和研究人员的影响，并强调了在开源 DL 项目中进行单元测试以确保其可靠性和稳定性的必要性。本研究通过提高人们对 DL 项目中单元测试重要性的认识，并鼓励在这一领域开展进一步研究，为这一社区做出了贡献。

{"title":"Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep Learning Projects","authors":"Han Wang, Sijia Yu, Chunyang Chen, Burak Turhan, Xiaodong Zhu","doi":"10.1145/3638245","DOIUrl":"https://doi.org/10.1145/3638245","url":null,"abstract":"Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness. However, it is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems. Therefore, we empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub. We find that: 1) unit tested DL projects have positive correlation with the open-source project metrics and have a higher acceptance rate of pull requests, 2) 68% of the sampled DL projects are not unit tested at all, 3) the layer and utilities (utils) of DL models have the most unit tests. Based on these findings and previous research outcomes, we built a mapping taxonomy between unit tests and faults in DL projects. We discuss the implications of our findings for developers and researchers and highlight the need for unit testing in open-source DL projects to ensure their reliability and stability. The study contributes to this community by raising awareness of the importance of unit testing in DL projects and encouraging further research in this area.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"1 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138819666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PACE: A Program Analysis Framework for Continuous Performance Prediction PACE：用于持续性能预测的程序分析框架

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2023-12-14 DOI: 10.1145/3637230

Chidera Biringa, Gökhan Kul

Software development teams establish elaborate continuous integration pipelines containing automated test cases to accelerate the development process of software. Automated tests help to verify the correctness of code modifications decreasing the response time to changing requirements. However, when the software teams do not track the performance impact of pending modifications, they may need to spend considerable time refactoring existing code. This paper presents PACE, a program analysis framework that provides continuous feedback on the performance impact of pending code updates. We design performance microbenchmarks by mapping the execution time of functional test cases given a code update. We map microbenchmarks to code stylometry features and feed them to predictors for performance predictions. Our experiments achieved significant performance in predicting code performance, outperforming current state-of-the-art by 75% on neural-represented code stylometry features.

软件开发团队建立了包含自动测试案例的精心设计的持续集成管道，以加快软件开发进程。自动化测试有助于验证代码修改的正确性，从而缩短对需求变化的响应时间。然而，如果软件团队不跟踪待处理修改对性能的影响，他们可能需要花费大量时间重构现有代码。本文介绍的 PACE 是一个程序分析框架，可持续反馈待处理代码更新对性能的影响。我们通过映射代码更新时功能测试用例的执行时间来设计性能微基准。我们将微基准映射到代码风格测量特征，并将其输入预测器进行性能预测。我们的实验在预测代码性能方面取得了显著的成绩，在神经呈现的代码风格测量特征方面比目前最先进的技术高出 75%。

引用次数: 0

Measuring and Clustering Heterogeneous Chatbot Designs 测量和聚类异构聊天机器人设计

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2023-12-13 DOI: 10.1145/3637228

Pablo C. Cañizares, Jose María López-Morales, Sara Pérez-Soler, Esther Guerra, Juan de Lara

Conversational agents, or chatbots, have become popular to access all kind of software services. They provide an intuitive natural language interface for interaction, available from a wide range of channels including social networks, web pages, intelligent speakers or cars. In response to this demand, many chatbot development platforms and tools have emerged. However, they typically lack support to statically measure properties of the chatbots being built, as indicators of their size, complexity, quality or usability. Similarly, there are hardly any mechanisms to compare and cluster chatbots developed with heterogeneous technologies.

To overcome this limitation, we propose a suite of 21 metrics for chatbot designs, as well as two clustering methods that help in grouping chatbots along their conversation topics and design features. Both the metrics and the clustering methods are defined on a neutral chatbot design language, becoming independent of the implementation platform. We provide automatic translations of chatbots defined on some major platforms into this neutral notation to perform the measurement and clustering. The approach is supported by our tool Asymob, which we have used to evaluate the metrics and the clustering methods over a set of 259 Dialogflow and Rasa chatbots from open-source repositories. The results open the door to incorporating the metrics within chatbot development processes for the early detection of quality issues, and to exploit clustering to organise large collections of chatbots into significant groups to ease chatbot comprehension, search and comparison.

会话代理或聊天机器人已经变得流行，可以访问各种软件服务。它们为交互提供了直观的自然语言界面，可从各种渠道获得，包括社交网络、网页、智能扬声器或汽车。针对这一需求，出现了许多聊天机器人开发平台和工具。然而，他们通常缺乏对正在构建的聊天机器人的静态测量属性的支持，作为它们的大小、复杂性、质量或可用性的指标。同样，几乎没有任何机制可以比较和集群使用异构技术开发的聊天机器人。为了克服这一限制，我们提出了一套21个聊天机器人设计指标，以及两种聚类方法，这些方法有助于根据聊天机器人的对话主题和设计特征对聊天机器人进行分组。度量和聚类方法都是在中立的聊天机器人设计语言上定义的，独立于实现平台。我们将一些主流平台上定义的聊天机器人自动翻译成这种中性符号，以执行测量和聚类。该方法由我们的工具Asymob支持，我们已经使用它来评估来自开源存储库的259个Dialogflow和Rasa聊天机器人的指标和聚类方法。研究结果为在聊天机器人开发过程中纳入指标，以便早期发现质量问题，并利用聚类将大量聊天机器人组织成重要组，以简化聊天机器人的理解、搜索和比较打开了大门。

{"title":"Measuring and Clustering Heterogeneous Chatbot Designs","authors":"Pablo C. Cañizares, Jose María López-Morales, Sara Pérez-Soler, Esther Guerra, Juan de Lara","doi":"10.1145/3637228","DOIUrl":"https://doi.org/10.1145/3637228","url":null,"abstract":"Conversational agents, or chatbots, have become popular to access all kind of software services. They provide an intuitive natural language interface for interaction, available from a wide range of channels including social networks, web pages, intelligent speakers or cars. In response to this demand, many chatbot development platforms and tools have emerged. However, they typically lack support to statically measure properties of the chatbots being built, as indicators of their size, complexity, quality or usability. Similarly, there are hardly any mechanisms to compare and cluster chatbots developed with heterogeneous technologies. To overcome this limitation, we propose a suite of 21 metrics for chatbot designs, as well as two clustering methods that help in grouping chatbots along their conversation topics and design features. Both the metrics and the clustering methods are defined on a neutral chatbot design language, becoming independent of the implementation platform. We provide automatic translations of chatbots defined on some major platforms into this neutral notation to perform the measurement and clustering. The approach is supported by our tool Asymob, which we have used to evaluate the metrics and the clustering methods over a set of 259 Dialogflow and Rasa chatbots from open-source repositories. The results open the door to incorporating the metrics within chatbot development processes for the early detection of quality issues, and to exploit clustering to organise large collections of chatbots into significant groups to ease chatbot comprehension, search and comparison.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"44 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138628466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Smart Contract Code Repair Recommendation based on Reinforcement Learning and Multi-metric Optimization 基于强化学习和多指标优化的智能合约代码修复建议

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2023-12-11 DOI: 10.1145/3637229

Hanyang Guo, Yingye Chen, Xiangping Chen, Yuan Huang, Zibin Zheng

A smart contract is a kind of code deployed on the blockchain that executes automatically once an event triggers a clause in the contract. Since smart contracts involve businesses such as asset transfer, they are more vulnerable to attacks, so it is crucial to ensure the security of smart contracts. Because a smart contract cannot be tampered with once deployed on the blockchain, for smart contract developers, it is necessary to fix vulnerabilities before deployment. Compared with many vulnerability detection tools for smart contracts, the amount of automatic fix approaches for smart contracts is relatively limited. These approaches mainly use defined pattern-based methods or heuristic search algorithms for vulnerability repairs. In this paper, we propose RLRep, a reinforcement learning-based approach to provide smart contract repair recommendations for smart contract developers automatically. This approach adopts an agent to provide repair action suggestions based on the vulnerable smart contract without any supervision, which can solve the problem of missing labeled data in machine learning-based repair methods. We evaluate our approach on a dataset containing 853 smart contract programs (programming language: Solidity) with different kinds of vulnerabilities. We split them into training and test set. The result shows that our approach can provide 54.97% correct repair recommendations for smart contracts.

智能合约是一种部署在区块链上的代码，一旦某个事件触发了合约中的某个条款，它就会自动执行。由于智能合约涉及资产转移等业务，更容易受到攻击，因此确保智能合约的安全性至关重要。由于智能合约一旦部署到区块链上就无法篡改，因此对于智能合约开发者来说，有必要在部署前修复漏洞。与许多智能合约漏洞检测工具相比，智能合约自动修复方法的数量相对有限。这些方法主要使用基于定义模式的方法或启发式搜索算法进行漏洞修复。在本文中，我们提出了基于强化学习的 RLRep 方法，为智能合约开发者自动提供智能合约修复建议。这种方法采用一个代理，在没有任何监督的情况下，根据有漏洞的智能合约提供修复行动建议，从而解决了基于机器学习的修复方法中标记数据缺失的问题。我们在一个包含 853 个存在不同类型漏洞的智能合约程序（编程语言：Solidity）的数据集上评估了我们的方法。我们将它们分为训练集和测试集。结果表明，我们的方法能为智能合约提供 54.97% 的正确修复建议。

{"title":"Smart Contract Code Repair Recommendation based on Reinforcement Learning and Multi-metric Optimization","authors":"Hanyang Guo, Yingye Chen, Xiangping Chen, Yuan Huang, Zibin Zheng","doi":"10.1145/3637229","DOIUrl":"https://doi.org/10.1145/3637229","url":null,"abstract":"A smart contract is a kind of code deployed on the blockchain that executes automatically once an event triggers a clause in the contract. Since smart contracts involve businesses such as asset transfer, they are more vulnerable to attacks, so it is crucial to ensure the security of smart contracts. Because a smart contract cannot be tampered with once deployed on the blockchain, for smart contract developers, it is necessary to fix vulnerabilities before deployment. Compared with many vulnerability detection tools for smart contracts, the amount of automatic fix approaches for smart contracts is relatively limited. These approaches mainly use defined pattern-based methods or heuristic search algorithms for vulnerability repairs. In this paper, we propose RLRep, a reinforcement learning-based approach to provide smart contract repair recommendations for smart contract developers automatically. This approach adopts an agent to provide repair action suggestions based on the vulnerable smart contract without any supervision, which can solve the problem of missing labeled data in machine learning-based repair methods. We evaluate our approach on a dataset containing 853 smart contract programs (programming language: Solidity) with different kinds of vulnerabilities. We split them into training and test set. The result shows that our approach can provide 54.97% correct repair recommendations for smart contracts.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"15 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138574724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Estimating Uncertainty in Labeled Changes by SZZ Tools on Just-In-Time Defect Prediction 用 SZZ 工具估算及时缺陷预测中标签变化的不确定性

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2023-12-11 DOI: 10.1145/3637226

Shikai Guo, Dongmin Li, Lin Huang, Sijia Lv, Rong Chen, Hui Li, Xiaochen Li, He Jiang

The aim of Just-In-Time (JIT) defect prediction is to predict software changes that are prone to defects in a project in a timely manner, thereby improving the efficiency of software development and ensuring software quality. Identifying changes that introduce bugs is a critical task in just-in-time defect prediction, and researchers have introduced the SZZ approach and its variants to label these changes. However, it has been shown that different SZZ algorithms introduce noise to the dataset to a certain extent, which may reduce the predictive performance of the model. To address this limitation, we propose the Confident Learning Imbalance (CLI) model. The model identifies and excludes samples whose labels may be corrupted by estimating the joint distribution of noisy labels and true labels, and mitigates the impact of noisy data on the performance of the prediction model. The CLI consists of two components: identifying noisy data (Confident Learning Component) and generating a predicted probability matrix for imbalanced data (Imbalanced Data Probabilistic Prediction Component). The IDPP component generates precise predicted probabilities for each instance in the training set, while the CL component uses the generated predicted probability matrix and noise labels to clean up the noise and build a classification model. We evaluate the performance of our model through extensive experiments on a total of 126,526 changes from ten Apache open source projects, and the results show that our model outperforms the baseline methods.

及时缺陷预测（JIT）的目的是及时预测项目中容易产生缺陷的软件变更，从而提高软件开发效率，确保软件质量。识别引入缺陷的变更是及时缺陷预测的一项关键任务，研究人员引入了 SZZ 方法及其变体来标记这些变更。然而，研究表明，不同的 SZZ 算法会在一定程度上给数据集带来噪声，这可能会降低模型的预测性能。为了解决这一局限性，我们提出了自信学习失衡（CLI）模型。该模型通过估计噪声标签和真实标签的联合分布，识别并排除标签可能被破坏的样本，减轻噪声数据对预测模型性能的影响。CLI 包括两个组件：识别噪声数据（自信学习组件）和为不平衡数据生成预测概率矩阵（不平衡数据概率预测组件）。IDPP 组件为训练集中的每个实例生成精确的预测概率，而 CL 组件则使用生成的预测概率矩阵和噪声标签来清理噪声并建立分类模型。我们通过对十个 Apache 开源项目中总共 126,526 次更改的大量实验来评估模型的性能，结果表明我们的模型优于基线方法。

{"title":"Estimating Uncertainty in Labeled Changes by SZZ Tools on Just-In-Time Defect Prediction","authors":"Shikai Guo, Dongmin Li, Lin Huang, Sijia Lv, Rong Chen, Hui Li, Xiaochen Li, He Jiang","doi":"10.1145/3637226","DOIUrl":"https://doi.org/10.1145/3637226","url":null,"abstract":"The aim of Just-In-Time (JIT) defect prediction is to predict software changes that are prone to defects in a project in a timely manner, thereby improving the efficiency of software development and ensuring software quality. Identifying changes that introduce bugs is a critical task in just-in-time defect prediction, and researchers have introduced the SZZ approach and its variants to label these changes. However, it has been shown that different SZZ algorithms introduce noise to the dataset to a certain extent, which may reduce the predictive performance of the model. To address this limitation, we propose the Confident Learning Imbalance (CLI) model. The model identifies and excludes samples whose labels may be corrupted by estimating the joint distribution of noisy labels and true labels, and mitigates the impact of noisy data on the performance of the prediction model. The CLI consists of two components: identifying noisy data (Confident Learning Component) and generating a predicted probability matrix for imbalanced data (Imbalanced Data Probabilistic Prediction Component). The IDPP component generates precise predicted probabilities for each instance in the training set, while the CL component uses the generated predicted probability matrix and noise labels to clean up the noise and build a classification model. We evaluate the performance of our model through extensive experiments on a total of 126,526 changes from ten Apache open source projects, and the results show that our model outperforms the baseline methods.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"25 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138577117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Smart Status Based Monitoring Algorithm for the Dynamic Analysis of Memory Safety 基于智能状态的内存安全动态分析监控算法

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2023-12-11 DOI: 10.1145/3637227

Zhe Chen, Rui Yan, Yingzi Ma, Yulei Sui, Jingling Xue

C is a dominant programming language for implementing system and low-level embedded software. Unfortunately, the unsafe nature of its low-level control of memory often leads to memory errors. Dynamic analysis has been widely used to detect memory errors at runtime. However, existing monitoring algorithms for dynamic analysis are not yet satisfactory as they cannot deterministically and completely detect some types of errors, e.g., segment confusion errors, sub-object overflows, use-after-frees and memory leaks.

We propose a new monitoring algorithm, namely Smatus, short for smart status, that improves memory safety by performing comprehensive dynamic analysis. The key innovation is to maintain at runtime a small status node for each memory object. A status node records the status value and reference count of an object, where the status value denotes the liveness and segment type of this object, and the reference count tracks the number of pointer variables pointing to this object. Smatus maintains at runtime a pointer metadata for each pointer variable, to record not only the base and bound of a pointer’s referent but also the address of the referent’s status node. All the pointers pointing to the same referent share the same status node in their pointer metadata. A status node is smart in the sense that it is automatically deleted when it becomes useless (indicated by its reference count reaching zero). To the best of our knowledge, Smatus represents the most comprehensive approach of its kind.

We have evaluated Smatus by using a large set of programs including the NIST Software Assurance Reference Dataset, MSBench, MiBench, SPEC and stress testing benchmarks. In terms of effectiveness (detecting different types of memory errors), Smatus outperforms state-of-the-art tools, Google’s AddressSanitizer, SoftBoundCETS and Valgrind, as it is capable of detecting more errors. In terms of performance (the time and memory overheads), Smatus outperforms SoftBoundCETS and Valgrind in terms of both lower time and memory overheads incurred, and is on par with AddressSanitizer in terms of the time and memory overhead tradeoff made (with much lower memory overheads incurred).

C 语言是实现系统和低级嵌入式软件的主流编程语言。遗憾的是，其低级内存控制的不安全性经常导致内存错误。动态分析已被广泛用于检测运行时的内存错误。然而，现有的动态分析监控算法还不能令人满意，因为它们无法确定性地完全检测到某些类型的错误，如段混淆错误、子对象溢出、使用后释放和内存泄漏。我们提出了一种新的监控算法，即 Smatus（智能状态的缩写），它通过执行全面的动态分析来提高内存安全性。该算法的主要创新点是在运行时为每个内存对象维护一个小型状态节点。状态节点记录对象的状态值和引用计数，其中状态值表示该对象的有效性和段类型，引用计数跟踪指向该对象的指针变量的数量。Smatus 会在运行时为每个指针变量维护一个指针元数据，不仅记录指针引用的基数和边界，还记录引用的状态节点地址。在指针元数据中，指向同一引用的所有指针共享同一个状态节点。状态节点是智能的，因为当它变得无用时（引用次数为零时），它就会被自动删除。据我们所知，Smatus 是同类中最全面的方法。我们使用大量程序对 Smatus 进行了评估，其中包括 NIST 软件保证参考数据集、MSBench、MiBench、SPEC 和压力测试基准。就有效性（检测不同类型的内存错误）而言，Smatus优于最先进的工具、Google的AddressSanitizer、SoftBoundCETS和Valgrind，因为它能检测出更多错误。在性能（时间和内存开销）方面，Smatus 的表现优于 SoftBoundCETS 和 Valgrind，时间和内存开销都更低，在时间和内存开销的权衡方面与 AddressSanitizer 不相上下（内存开销更低）。

{"title":"A Smart Status Based Monitoring Algorithm for the Dynamic Analysis of Memory Safety","authors":"Zhe Chen, Rui Yan, Yingzi Ma, Yulei Sui, Jingling Xue","doi":"10.1145/3637227","DOIUrl":"https://doi.org/10.1145/3637227","url":null,"abstract":"C is a dominant programming language for implementing system and low-level embedded software. Unfortunately, the unsafe nature of its low-level control of memory often leads to memory errors. Dynamic analysis has been widely used to detect memory errors at runtime. However, existing monitoring algorithms for dynamic analysis are not yet satisfactory as they cannot deterministically and completely detect some types of errors, e.g., segment confusion errors, sub-object overflows, use-after-frees and memory leaks. We propose a new monitoring algorithm, namely Smatus, short for smart status, that improves memory safety by performing comprehensive dynamic analysis. The key innovation is to maintain at runtime a small status node for each memory object. A status node records the status value and reference count of an object, where the status value denotes the liveness and segment type of this object, and the reference count tracks the number of pointer variables pointing to this object. Smatus maintains at runtime a pointer metadata for each pointer variable, to record not only the base and bound of a pointer’s referent but also the address of the referent’s status node. All the pointers pointing to the same referent share the same status node in their pointer metadata. A status node is smart in the sense that it is automatically deleted when it becomes useless (indicated by its reference count reaching zero). To the best of our knowledge, Smatus represents the most comprehensive approach of its kind. We have evaluated Smatus by using a large set of programs including the NIST Software Assurance Reference Dataset, MSBench, MiBench, SPEC and stress testing benchmarks. In terms of effectiveness (detecting different types of memory errors), Smatus outperforms state-of-the-art tools, Google’s AddressSanitizer, SoftBoundCETS and Valgrind, as it is capable of detecting more errors. In terms of performance (the time and memory overheads), Smatus outperforms SoftBoundCETS and Valgrind in terms of both lower time and memory overheads incurred, and is on par with AddressSanitizer in terms of the time and memory overhead tradeoff made (with much lower memory overheads incurred).","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"13 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138569386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Algorithm Selection for Software Verification using Graph Neural Networks 利用图神经网络为软件验证选择算法

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2023-12-11 DOI: 10.1145/3637225

Will Leeson, Matthew B. Dwyer

The field of software verification has produced a wide array of algorithmic techniques that can prove a variety of properties of a given program. It has been demonstrated that the performance of these techniques can vary up to 4 orders of magnitude on the same verification problem. Even for verification experts, it is difficult to decide which tool will perform best on a given problem. For general users, deciding the best tool for their verification problem is effectively impossible.

In this work, we present Graves, a selection strategy based on graph neural networks (GNNs). Graves generates a graph representation of a program from which a GNN predicts a score for a verifier that indicates its performance on the program.

We evaluate Graves on a set of 10 verification tools and over 8000 verification problems and find that it improves the state-of-the-art in verification algorithm selection by 12%, or 8 percentage points. Further, it is able to verify 9% more problems than any existing verifier on our test set. Through a qualitative study on model interpretability, we find strong evidence that the Graves’ model learns to base its predictions on factors that relate to the unique features of the algorithmic techniques.

软件验证领域产生了一系列算法技术，可以证明给定程序的各种属性。事实证明，在同一个验证问题上，这些技术的性能可以相差 4 个数量级。即使是验证专家，也很难决定哪种工具在特定问题上表现最佳。对于普通用户来说，为他们的验证问题选择最佳工具实际上是不可能的。在这项工作中，我们提出了基于图神经网络（GNN）的选择策略 Graves。Graves 生成程序的图表示，GNN 据此预测验证器的得分，以显示其在程序上的性能。我们在一组 10 种验证工具和 8000 多个验证问题上对 Graves 进行了评估，发现它在验证算法选择方面比最新技术提高了 12%，即 8 个百分点。此外，在我们的测试集上，它比任何现有验证器多验证了 9% 的问题。通过对模型可解释性的定性研究，我们发现有力的证据表明，格拉夫模型学会了根据与算法技术独特特征相关的因素进行预测。

{"title":"Algorithm Selection for Software Verification using Graph Neural Networks","authors":"Will Leeson, Matthew B. Dwyer","doi":"10.1145/3637225","DOIUrl":"https://doi.org/10.1145/3637225","url":null,"abstract":"The field of software verification has produced a wide array of algorithmic techniques that can prove a variety of properties of a given program. It has been demonstrated that the performance of these techniques can vary up to 4 orders of magnitude on the same verification problem. Even for verification experts, it is difficult to decide which tool will perform best on a given problem. For general users, deciding the best tool for their verification problem is effectively impossible. In this work, we present Graves, a selection strategy based on graph neural networks (GNNs). Graves generates a graph representation of a program from which a GNN predicts a score for a verifier that indicates its performance on the program. We evaluate Graves on a set of 10 verification tools and over 8000 verification problems and find that it improves the state-of-the-art in verification algorithm selection by 12%, or 8 percentage points. Further, it is able to verify 9% more problems than any existing verifier on our test set. Through a qualitative study on model interpretability, we find strong evidence that the Graves’ model learns to base its predictions on factors that relate to the unique features of the algorithmic techniques.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"52 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138569131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Representation Learning for Stack Overflow Posts: How Far are We? Stack Overflow 帖子的表征学习：我们还有多远？

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2023-12-07 DOI: 10.1145/3635711

Junda He, Xin Zhou, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Clairine Irsan, David Lo

The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network (CNN) and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks, i.e., tag recommendation, relatedness prediction, and API recommendation. The results show that Post2Vec cannot further improve the state-of-the-art techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, GPT2) and (2) language models built with software engineering-related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the state-of-the-art performance significantly for all the downstream tasks.

Stack Overflow 的巨大成功积累了大量的软件工程知识，因此促使研究人员提出了各种分析其内容的解决方案。这些解决方案的性能在很大程度上取决于对 Stack Overflow 帖子表示模型的选择。随着有关 Stack Overflow 的文献数量不断激增，这凸显了对强大的 Stack Overflow 帖子表示模型的需求，并推动了研究人员对开发能巧妙捕捉 Stack Overflow 帖子复杂性的专门表示模型的兴趣。最先进的 Stack Overflow 帖子表示模型（SOTA）是 Post2Vec 和 BERTOverflow，它们建立在卷积神经网络（CNN）和变压器架构（如 BERT）等神经网络的基础上。尽管这些表示方法取得了可喜的成果，但还没有在相同的实验环境中进行过评估。为了填补这一研究空白，我们首先对专为 Stack Overflow 帖子设计的表示模型（Post2Vec 和 BERTOverflow）在一系列相关任务（即标签推荐、相关性预测和 API 推荐）中的性能进行了实证比较。结果表明，在所考虑的下游任务中，Post2Vec 无法进一步改进最先进的技术，而 BERTOverflow 则表现出令人惊讶的低劣性能。为了找到更适合帖子的表示模型，我们进一步探索了一系列基于转换器的模型，包括：（1）通用领域语言模型（RoBERTa、Longformer、GPT2）和（2）使用软件工程相关文本工件构建的语言模型（CodeBERT、GraphCodeBERT、seBERT、CodeT5、PLBart 和 CodeGen）。这一探索表明，CodeBERT 和 RoBERTa 等模型适用于表示 Stack Overflow 帖子。不过，这也说明了 "没有银弹 "的概念，因为没有一个模型能在与所有其他模型的竞争中始终胜出。受这一发现的启发，我们提出了 SOBERT，它采用了一种简单而有效的策略，通过继续使用 Stack Overflow 的文本工件进行预训练来改进 Stack Overflow 帖子的表示模型。总体实验结果表明，SOBERT 可以持续超越所考虑的模型，并在所有下游任务中显著提高最先进的性能。

{"title":"Representation Learning for Stack Overflow Posts: How Far are We?","authors":"Junda He, Xin Zhou, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Clairine Irsan, David Lo","doi":"10.1145/3635711","DOIUrl":"https://doi.org/10.1145/3635711","url":null,"abstract":"The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network (CNN) and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks, i.e., tag recommendation, relatedness prediction, and API recommendation. The results show that Post2Vec cannot further improve the state-of-the-art techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, GPT2) and (2) language models built with software engineering-related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the state-of-the-art performance significantly for all the downstream tasks.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"101 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138547269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6