Due to the rapid adoption of Deep Neural Networks (DNNs) into larger software systems, testing of DNN based systems has received much attention recently. While many different test adequacy criteria have been suggested, we lack effective test input generation techniques. Inputs such as images of real world objects and scenes are not only expensive to collect but also difficult to randomly sample. Consequently, current testing techniques for DNNs tend to apply small local perturbations to existing inputs to generate new inputs. We propose SINVAD, a way to sample from, and navigate over, a space of realistic inputs that resembles the true distribution in the training data. Our input space is constructed using Variational AutoEncoders (VAEs), and navigated through their latent vector space. Our analysis shows that the VAE-based input space is well-aligned with human perception of what constitutes realistic inputs. Further, we show that this space can be effectively searched to achieve various testing scenarios, such as boundary testing of two different DNNs or analyzing class labels that are difficult for the given DNN to distinguish. Guidelines on how to design VAE architectures are presented as well. Our results have the potential to open the field to meaningful exploration through the space of highly structured images.
{"title":"Deceiving Humans and Machines Alike: Search-based Test Input Generation for DNNs using Variational Autoencoders","authors":"Sungmin Kang, Robert Feldt, Shin Yoo","doi":"10.1145/3635706","DOIUrl":"https://doi.org/10.1145/3635706","url":null,"abstract":"<p>Due to the rapid adoption of Deep Neural Networks (DNNs) into larger software systems, testing of DNN based systems has received much attention recently. While many different test adequacy criteria have been suggested, we lack effective test input generation techniques. Inputs such as images of real world objects and scenes are not only expensive to collect but also difficult to randomly sample. Consequently, current testing techniques for DNNs tend to apply small local perturbations to existing inputs to generate new inputs. We propose SINVAD, a way to sample from, and navigate over, a space of realistic inputs that resembles the true distribution in the training data. Our input space is constructed using Variational AutoEncoders (VAEs), and navigated through their latent vector space. Our analysis shows that the VAE-based input space is well-aligned with human perception of what constitutes realistic inputs. Further, we show that this space can be effectively searched to achieve various testing scenarios, such as boundary testing of two different DNNs or analyzing class labels that are difficult for the given DNN to distinguish. Guidelines on how to design VAE architectures are presented as well. Our results have the potential to open the field to meaningful exploration through the space of highly structured images.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"1 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138823838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Baharin A. Jodat, Abhishek Chandar, Shiva Nejati, Mehrdad Sabetzadeh
Test inputs fail not only when the system under test is faulty but also when the inputs are invalid or unrealistic. Failures resulting from invalid or unrealistic test inputs are spurious. Avoiding spurious failures improves the effectiveness of testing in exercising the main functions of a system, particularly for compute-intensive (CI) systems where a single test execution takes significant time. In this paper, we propose to build failure models for inferring interpretable rules on test inputs that cause spurious failures. We examine two alternative strategies for building failure models: (1) machine learning (ML)-guided test generation and (2) surrogate-assisted test generation. ML-guided test generation infers boundary regions that separate passing and failing test inputs and samples test inputs from those regions. Surrogate-assisted test generation relies on surrogate models to predict labels for test inputs instead of exercising all the inputs. We propose a novel surrogate-assisted algorithm that uses multiple surrogate models simultaneously, and dynamically selects the prediction from the most accurate model. We empirically evaluate the accuracy of failure models inferred based on surrogate-assisted and ML-guided test generation algorithms. Using case studies from the domains of cyber-physical systems and networks, we show that our proposed surrogate-assisted approach generates failure models with an average accuracy of 83%, significantly outperforming ML-guided test generation and two baselines. Further, our approach learns failure-inducing rules that identify genuine spurious failures as validated against domain knowledge.
测试输入不仅会在被测系统出现故障时失效,而且会在输入无效或不切实际时失效。无效或不切实际的测试输入导致的故障是假故障。避免虚假故障可以提高测试的有效性,从而检验系统的主要功能,尤其是对于计算密集型(CI)系统,因为在这种系统中,执行一次测试需要花费大量时间。在本文中,我们建议建立故障模型,以推断导致虚假故障的测试输入的可解释规则。我们研究了建立故障模型的两种备选策略:(1) 机器学习(ML)指导下的测试生成和 (2) 代理辅助测试生成。机器学习指导下的测试生成会推断出分隔合格和不合格测试输入的边界区域,并从这些区域对测试输入进行采样。代理辅助测试生成依赖于代理模型来预测测试输入的标签,而不是对所有输入进行测试。我们提出了一种新颖的代用辅助算法,该算法可同时使用多个代用模型,并动态选择最准确的模型进行预测。我们对基于代理辅助算法和 ML 引导测试生成算法推断出的故障模型的准确性进行了实证评估。通过对网络物理系统和网络领域的案例研究,我们发现我们提出的代理辅助方法生成故障模型的平均准确率为 83%,明显优于 ML 引导测试生成算法和两种基线算法。此外,我们的方法还能学习故障诱发规则,并根据领域知识进行验证,从而识别真正的虚假故障。
{"title":"Test Generation Strategies for Building Failure Models and Explaining Spurious Failures","authors":"Baharin A. Jodat, Abhishek Chandar, Shiva Nejati, Mehrdad Sabetzadeh","doi":"10.1145/3638246","DOIUrl":"https://doi.org/10.1145/3638246","url":null,"abstract":"<p>Test inputs fail not only when the system under test is faulty but also when the inputs are invalid or unrealistic. Failures resulting from invalid or unrealistic test inputs are spurious. Avoiding spurious failures improves the effectiveness of testing in exercising the main functions of a system, particularly for compute-intensive (CI) systems where a single test execution takes significant time. In this paper, we propose to build failure models for inferring interpretable rules on test inputs that cause spurious failures. We examine two alternative strategies for building failure models: (1) machine learning (ML)-guided test generation and (2) surrogate-assisted test generation. <i>ML-guided test generation</i> infers boundary regions that separate passing and failing test inputs and samples test inputs from those regions. <i>Surrogate-assisted test generation</i> relies on surrogate models to predict labels for test inputs instead of exercising all the inputs. We propose a novel surrogate-assisted algorithm that uses multiple surrogate models simultaneously, and dynamically selects the prediction from the most accurate model. We empirically evaluate the accuracy of failure models inferred based on surrogate-assisted and ML-guided test generation algorithms. Using case studies from the domains of cyber-physical systems and networks, we show that our proposed surrogate-assisted approach generates failure models with an average accuracy of 83%, significantly outperforming ML-guided test generation and two baselines. Further, our approach learns failure-inducing rules that identify genuine spurious failures as validated against domain knowledge.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"81 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139028039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Han Wang, Sijia Yu, Chunyang Chen, Burak Turhan, Xiaodong Zhu
Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness. However, it is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems. Therefore, we empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub. We find that: 1) unit tested DL projects have positive correlation with the open-source project metrics and have a higher acceptance rate of pull requests, 2) 68% of the sampled DL projects are not unit tested at all, 3) the layer and utilities (utils) of DL models have the most unit tests. Based on these findings and previous research outcomes, we built a mapping taxonomy between unit tests and faults in DL projects. We discuss the implications of our findings for developers and researchers and highlight the need for unit testing in open-source DL projects to ensure their reliability and stability. The study contributes to this community by raising awareness of the importance of unit testing in DL projects and encouraging further research in this area.
{"title":"Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep Learning Projects","authors":"Han Wang, Sijia Yu, Chunyang Chen, Burak Turhan, Xiaodong Zhu","doi":"10.1145/3638245","DOIUrl":"https://doi.org/10.1145/3638245","url":null,"abstract":"<p>Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness. However, it is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems. Therefore, we empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub. We find that: 1) unit tested DL projects have positive correlation with the open-source project metrics and have a higher acceptance rate of pull requests, 2) 68% of the sampled DL projects are not unit tested at all, 3) the layer and utilities (utils) of DL models have the most unit tests. Based on these findings and previous research outcomes, we built a mapping taxonomy between unit tests and faults in DL projects. We discuss the implications of our findings for developers and researchers and highlight the need for unit testing in open-source DL projects to ensure their reliability and stability. The study contributes to this community by raising awareness of the importance of unit testing in DL projects and encouraging further research in this area.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"1 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138819666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software development teams establish elaborate continuous integration pipelines containing automated test cases to accelerate the development process of software. Automated tests help to verify the correctness of code modifications decreasing the response time to changing requirements. However, when the software teams do not track the performance impact of pending modifications, they may need to spend considerable time refactoring existing code. This paper presents PACE, a program analysis framework that provides continuous feedback on the performance impact of pending code updates. We design performance microbenchmarks by mapping the execution time of functional test cases given a code update. We map microbenchmarks to code stylometry features and feed them to predictors for performance predictions. Our experiments achieved significant performance in predicting code performance, outperforming current state-of-the-art by 75% on neural-represented code stylometry features.
软件开发团队建立了包含自动测试案例的精心设计的持续集成管道,以加快软件开发进程。自动化测试有助于验证代码修改的正确性,从而缩短对需求变化的响应时间。然而,如果软件团队不跟踪待处理修改对性能的影响,他们可能需要花费大量时间重构现有代码。本文介绍的 PACE 是一个程序分析框架,可持续反馈待处理代码更新对性能的影响。我们通过映射代码更新时功能测试用例的执行时间来设计性能微基准。我们将微基准映射到代码风格测量特征,并将其输入预测器进行性能预测。我们的实验在预测代码性能方面取得了显著的成绩,在神经呈现的代码风格测量特征方面比目前最先进的技术高出 75%。
{"title":"PACE: A Program Analysis Framework for Continuous Performance Prediction","authors":"Chidera Biringa, Gökhan Kul","doi":"10.1145/3637230","DOIUrl":"https://doi.org/10.1145/3637230","url":null,"abstract":"<p>Software development teams establish elaborate continuous integration pipelines containing automated test cases to accelerate the development process of software. Automated tests help to verify the correctness of code modifications decreasing the response time to changing requirements. However, when the software teams do not track the performance impact of pending modifications, they may need to spend considerable time refactoring existing code. This paper presents <monospace>PACE</monospace>, a program analysis framework that provides continuous feedback on the performance impact of pending code updates. We design performance microbenchmarks by mapping the execution time of functional test cases given a code update. We map microbenchmarks to code stylometry features and feed them to predictors for performance predictions. Our experiments achieved significant performance in predicting code performance, outperforming current state-of-the-art by 75% on neural-represented code stylometry features.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"2 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138691862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pablo C. Cañizares, Jose María López-Morales, Sara Pérez-Soler, Esther Guerra, Juan de Lara
Conversational agents, or chatbots, have become popular to access all kind of software services. They provide an intuitive natural language interface for interaction, available from a wide range of channels including social networks, web pages, intelligent speakers or cars. In response to this demand, many chatbot development platforms and tools have emerged. However, they typically lack support to statically measure properties of the chatbots being built, as indicators of their size, complexity, quality or usability. Similarly, there are hardly any mechanisms to compare and cluster chatbots developed with heterogeneous technologies.
To overcome this limitation, we propose a suite of 21 metrics for chatbot designs, as well as two clustering methods that help in grouping chatbots along their conversation topics and design features. Both the metrics and the clustering methods are defined on a neutral chatbot design language, becoming independent of the implementation platform. We provide automatic translations of chatbots defined on some major platforms into this neutral notation to perform the measurement and clustering. The approach is supported by our tool Asymob, which we have used to evaluate the metrics and the clustering methods over a set of 259 Dialogflow and Rasa chatbots from open-source repositories. The results open the door to incorporating the metrics within chatbot development processes for the early detection of quality issues, and to exploit clustering to organise large collections of chatbots into significant groups to ease chatbot comprehension, search and comparison.
{"title":"Measuring and Clustering Heterogeneous Chatbot Designs","authors":"Pablo C. Cañizares, Jose María López-Morales, Sara Pérez-Soler, Esther Guerra, Juan de Lara","doi":"10.1145/3637228","DOIUrl":"https://doi.org/10.1145/3637228","url":null,"abstract":"<p>Conversational agents, or chatbots, have become popular to access all kind of software services. They provide an intuitive natural language interface for interaction, available from a wide range of channels including social networks, web pages, intelligent speakers or cars. In response to this demand, many chatbot development platforms and tools have emerged. However, they typically lack support to statically measure properties of the chatbots being built, as indicators of their size, complexity, quality or usability. Similarly, there are hardly any mechanisms to compare and cluster chatbots developed with heterogeneous technologies. </p><p>To overcome this limitation, we propose a suite of 21 metrics for chatbot designs, as well as two clustering methods that help in grouping chatbots along their conversation topics and design features. Both the metrics and the clustering methods are defined on a neutral chatbot design language, becoming independent of the implementation platform. We provide automatic translations of chatbots defined on some major platforms into this neutral notation to perform the measurement and clustering. The approach is supported by our tool <span>Asymob</span>, which we have used to evaluate the metrics and the clustering methods over a set of 259 Dialogflow and Rasa chatbots from open-source repositories. The results open the door to incorporating the metrics within chatbot development processes for the early detection of quality issues, and to exploit clustering to organise large collections of chatbots into significant groups to ease chatbot comprehension, search and comparison.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"44 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138628466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A smart contract is a kind of code deployed on the blockchain that executes automatically once an event triggers a clause in the contract. Since smart contracts involve businesses such as asset transfer, they are more vulnerable to attacks, so it is crucial to ensure the security of smart contracts. Because a smart contract cannot be tampered with once deployed on the blockchain, for smart contract developers, it is necessary to fix vulnerabilities before deployment. Compared with many vulnerability detection tools for smart contracts, the amount of automatic fix approaches for smart contracts is relatively limited. These approaches mainly use defined pattern-based methods or heuristic search algorithms for vulnerability repairs. In this paper, we propose RLRep, a reinforcement learning-based approach to provide smart contract repair recommendations for smart contract developers automatically. This approach adopts an agent to provide repair action suggestions based on the vulnerable smart contract without any supervision, which can solve the problem of missing labeled data in machine learning-based repair methods. We evaluate our approach on a dataset containing 853 smart contract programs (programming language: Solidity) with different kinds of vulnerabilities. We split them into training and test set. The result shows that our approach can provide 54.97% correct repair recommendations for smart contracts.
{"title":"Smart Contract Code Repair Recommendation based on Reinforcement Learning and Multi-metric Optimization","authors":"Hanyang Guo, Yingye Chen, Xiangping Chen, Yuan Huang, Zibin Zheng","doi":"10.1145/3637229","DOIUrl":"https://doi.org/10.1145/3637229","url":null,"abstract":"<p>A smart contract is a kind of code deployed on the blockchain that executes automatically once an event triggers a clause in the contract. Since smart contracts involve businesses such as asset transfer, they are more vulnerable to attacks, so it is crucial to ensure the security of smart contracts. Because a smart contract cannot be tampered with once deployed on the blockchain, for smart contract developers, it is necessary to fix vulnerabilities before deployment. Compared with many vulnerability detection tools for smart contracts, the amount of automatic fix approaches for smart contracts is relatively limited. These approaches mainly use defined pattern-based methods or heuristic search algorithms for vulnerability repairs. In this paper, we propose <i>RLRep</i>, a reinforcement learning-based approach to provide smart contract repair recommendations for smart contract developers automatically. This approach adopts an agent to provide repair action suggestions based on the vulnerable smart contract without any supervision, which can solve the problem of missing labeled data in machine learning-based repair methods. We evaluate our approach on a dataset containing 853 smart contract programs (programming language: Solidity) with different kinds of vulnerabilities. We split them into training and test set. The result shows that our approach can provide 54.97% correct repair recommendations for smart contracts.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"15 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138574724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shikai Guo, Dongmin Li, Lin Huang, Sijia Lv, Rong Chen, Hui Li, Xiaochen Li, He Jiang
The aim of Just-In-Time (JIT) defect prediction is to predict software changes that are prone to defects in a project in a timely manner, thereby improving the efficiency of software development and ensuring software quality. Identifying changes that introduce bugs is a critical task in just-in-time defect prediction, and researchers have introduced the SZZ approach and its variants to label these changes. However, it has been shown that different SZZ algorithms introduce noise to the dataset to a certain extent, which may reduce the predictive performance of the model. To address this limitation, we propose the Confident Learning Imbalance (CLI) model. The model identifies and excludes samples whose labels may be corrupted by estimating the joint distribution of noisy labels and true labels, and mitigates the impact of noisy data on the performance of the prediction model. The CLI consists of two components: identifying noisy data (Confident Learning Component) and generating a predicted probability matrix for imbalanced data (Imbalanced Data Probabilistic Prediction Component). The IDPP component generates precise predicted probabilities for each instance in the training set, while the CL component uses the generated predicted probability matrix and noise labels to clean up the noise and build a classification model. We evaluate the performance of our model through extensive experiments on a total of 126,526 changes from ten Apache open source projects, and the results show that our model outperforms the baseline methods.
{"title":"Estimating Uncertainty in Labeled Changes by SZZ Tools on Just-In-Time Defect Prediction","authors":"Shikai Guo, Dongmin Li, Lin Huang, Sijia Lv, Rong Chen, Hui Li, Xiaochen Li, He Jiang","doi":"10.1145/3637226","DOIUrl":"https://doi.org/10.1145/3637226","url":null,"abstract":"<p>The aim of Just-In-Time (JIT) defect prediction is to predict software changes that are prone to defects in a project in a timely manner, thereby improving the efficiency of software development and ensuring software quality. Identifying changes that introduce bugs is a critical task in just-in-time defect prediction, and researchers have introduced the SZZ approach and its variants to label these changes. However, it has been shown that different SZZ algorithms introduce noise to the dataset to a certain extent, which may reduce the predictive performance of the model. To address this limitation, we propose the Confident Learning Imbalance (CLI) model. The model identifies and excludes samples whose labels may be corrupted by estimating the joint distribution of noisy labels and true labels, and mitigates the impact of noisy data on the performance of the prediction model. The CLI consists of two components: identifying noisy data (Confident Learning Component) and generating a predicted probability matrix for imbalanced data (Imbalanced Data Probabilistic Prediction Component). The IDPP component generates precise predicted probabilities for each instance in the training set, while the CL component uses the generated predicted probability matrix and noise labels to clean up the noise and build a classification model. We evaluate the performance of our model through extensive experiments on a total of 126,526 changes from ten Apache open source projects, and the results show that our model outperforms the baseline methods.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"25 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138577117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C is a dominant programming language for implementing system and low-level embedded software. Unfortunately, the unsafe nature of its low-level control of memory often leads to memory errors. Dynamic analysis has been widely used to detect memory errors at runtime. However, existing monitoring algorithms for dynamic analysis are not yet satisfactory as they cannot deterministically and completely detect some types of errors, e.g., segment confusion errors, sub-object overflows, use-after-frees and memory leaks.
We propose a new monitoring algorithm, namely Smatus, short for smart status, that improves memory safety by performing comprehensive dynamic analysis. The key innovation is to maintain at runtime a small status node for each memory object. A status node records the status value and reference count of an object, where the status value denotes the liveness and segment type of this object, and the reference count tracks the number of pointer variables pointing to this object. Smatus maintains at runtime a pointer metadata for each pointer variable, to record not only the base and bound of a pointer’s referent but also the address of the referent’s status node. All the pointers pointing to the same referent share the same status node in their pointer metadata. A status node is smart in the sense that it is automatically deleted when it becomes useless (indicated by its reference count reaching zero). To the best of our knowledge, Smatus represents the most comprehensive approach of its kind.
We have evaluated Smatus by using a large set of programs including the NIST Software Assurance Reference Dataset, MSBench, MiBench, SPEC and stress testing benchmarks. In terms of effectiveness (detecting different types of memory errors), Smatus outperforms state-of-the-art tools, Google’s AddressSanitizer, SoftBoundCETS and Valgrind, as it is capable of detecting more errors. In terms of performance (the time and memory overheads), Smatus outperforms SoftBoundCETS and Valgrind in terms of both lower time and memory overheads incurred, and is on par with AddressSanitizer in terms of the time and memory overhead tradeoff made (with much lower memory overheads incurred).
{"title":"A Smart Status Based Monitoring Algorithm for the Dynamic Analysis of Memory Safety","authors":"Zhe Chen, Rui Yan, Yingzi Ma, Yulei Sui, Jingling Xue","doi":"10.1145/3637227","DOIUrl":"https://doi.org/10.1145/3637227","url":null,"abstract":"<p>C is a dominant programming language for implementing system and low-level embedded software. Unfortunately, the unsafe nature of its low-level control of memory often leads to memory errors. Dynamic analysis has been widely used to detect memory errors at runtime. However, existing monitoring algorithms for dynamic analysis are not yet satisfactory as they cannot deterministically and completely detect some types of errors, e.g., segment confusion errors, sub-object overflows, use-after-frees and memory leaks. </p><p>We propose a new monitoring algorithm, namely <span>Smatus</span>, short for <i>smart status</i>, that improves memory safety by performing comprehensive dynamic analysis. The key innovation is to maintain at runtime a small <i>status node</i> for each memory object. A status node records the <i>status value</i> and <i>reference count</i> of an object, where the status value denotes the liveness and segment type of this object, and the reference count tracks the number of pointer variables pointing to this object. <span>Smatus</span> maintains at runtime a pointer metadata for each pointer variable, to record not only the base and bound of a pointer’s referent but also the address of the referent’s status node. All the pointers pointing to the same referent share the same status node in their pointer metadata. A status node is <i>smart</i> in the sense that it is automatically deleted when it becomes useless (indicated by its reference count reaching zero). To the best of our knowledge, <span>Smatus</span> represents the most comprehensive approach of its kind. </p><p>We have evaluated <span>Smatus</span> by using a large set of programs including the NIST Software Assurance Reference Dataset, MSBench, MiBench, SPEC and stress testing benchmarks. In terms of effectiveness (detecting different types of memory errors), <span>Smatus</span> outperforms state-of-the-art tools, Google’s AddressSanitizer, SoftBoundCETS and Valgrind, as it is capable of detecting more errors. In terms of performance (the time and memory overheads), <span>Smatus</span> outperforms SoftBoundCETS and Valgrind in terms of both lower time and memory overheads incurred, and is on par with AddressSanitizer in terms of the time and memory overhead tradeoff made (with much lower memory overheads incurred).</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"13 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138569386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The field of software verification has produced a wide array of algorithmic techniques that can prove a variety of properties of a given program. It has been demonstrated that the performance of these techniques can vary up to 4 orders of magnitude on the same verification problem. Even for verification experts, it is difficult to decide which tool will perform best on a given problem. For general users, deciding the best tool for their verification problem is effectively impossible.
In this work, we present Graves, a selection strategy based on graph neural networks (GNNs). Graves generates a graph representation of a program from which a GNN predicts a score for a verifier that indicates its performance on the program.
We evaluate Graves on a set of 10 verification tools and over 8000 verification problems and find that it improves the state-of-the-art in verification algorithm selection by 12%, or 8 percentage points. Further, it is able to verify 9% more problems than any existing verifier on our test set. Through a qualitative study on model interpretability, we find strong evidence that the Graves’ model learns to base its predictions on factors that relate to the unique features of the algorithmic techniques.
{"title":"Algorithm Selection for Software Verification using Graph Neural Networks","authors":"Will Leeson, Matthew B. Dwyer","doi":"10.1145/3637225","DOIUrl":"https://doi.org/10.1145/3637225","url":null,"abstract":"<p>The field of software verification has produced a wide array of algorithmic techniques that can prove a variety of properties of a given program. It has been demonstrated that the performance of these techniques can vary up to 4 orders of magnitude on the same verification problem. Even for verification experts, it is difficult to decide which tool will perform best on a given problem. For general users, deciding the best tool for their verification problem is effectively impossible. </p><p>In this work, we present <span>Graves</span>, a selection strategy based on graph neural networks (GNNs). <span>Graves</span> generates a graph representation of a program from which a GNN predicts a score for a verifier that indicates its performance on the program. </p><p>We evaluate <span>Graves</span> on a set of 10 verification tools and over 8000 verification problems and find that it improves the state-of-the-art in verification algorithm selection by 12%, or 8 percentage points. Further, it is able to verify 9% more problems than any existing verifier on our test set. Through a qualitative study on model interpretability, we find strong evidence that the <span>Graves</span>’ model learns to base its predictions on factors that relate to the unique features of the algorithmic techniques.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"52 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138569131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junda He, Xin Zhou, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Clairine Irsan, David Lo
The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network (CNN) and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks, i.e., tag recommendation, relatedness prediction, and API recommendation. The results show that Post2Vec cannot further improve the state-of-the-art techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, GPT2) and (2) language models built with software engineering-related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the state-of-the-art performance significantly for all the downstream tasks.
{"title":"Representation Learning for Stack Overflow Posts: How Far are We?","authors":"Junda He, Xin Zhou, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Clairine Irsan, David Lo","doi":"10.1145/3635711","DOIUrl":"https://doi.org/10.1145/3635711","url":null,"abstract":"<p>The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network (CNN) and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks, i.e., tag recommendation, relatedness prediction, and API recommendation. The results show that Post2Vec cannot further improve the state-of-the-art techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, GPT2) and (2) language models built with software engineering-related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the state-of-the-art performance significantly for all the downstream tasks.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"101 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138547269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}