Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis最新文献_第4页

Deep just-in-time defect prediction: how far are we? 深度即时缺陷预测:我们做了多远?

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464819

Zhen Zeng, Yuqun Zhang, Haotian Zhang, Lingming Zhang

Defect prediction aims to automatically identify potential defective code with minimal human intervention and has been widely studied in the literature. Just-in-Time (JIT) defect prediction focuses on program changes rather than whole programs, and has been widely adopted in continuous testing. CC2Vec, state-of-the-art JIT defect prediction tool, first constructs a hierarchical attention network (HAN) to learn distributed vector representations of both code additions and deletions, and then concatenates them with two other embedding vectors representing commit messages and overall code changes extracted by the existing DeepJIT approach to train a model for predicting whether a given commit is defective. Although CC2Vec has been shown to be the state of the art for JIT defect prediction, it was only evaluated on a limited dataset and not compared with all representative baselines. Therefore, to further investigate the efficacy and limitations of CC2Vec, this paper performs an extensive study of CC2Vec on a large-scale dataset with over 310,370 changes (8.3 X larger than the original CC2Vec dataset). More specifically, we also empirically compare CC2Vec against DeepJIT and representative traditional JIT defect prediction techniques. The experimental results show that CC2Vec cannot consistently outperform DeepJIT, and neither of them can consistently outperform traditional JIT defect prediction. We also investigate the impact of individual traditional defect prediction features and find that the added-line-number feature outperforms other traditional features. Inspired by this finding, we construct a simplistic JIT defect prediction approach which simply adopts the added-line-number feature with the logistic regression classifier. Surprisingly, such a simplistic approach can outperform CC2Vec and DeepJIT in defect prediction, and can be 81k X/120k X faster in training/testing. Furthermore, the paper also provides various practical guidelines for advancing JIT defect prediction in the near future.

缺陷预测旨在以最小的人为干预自动识别潜在的缺陷代码，并在文献中得到了广泛的研究。JIT (Just-in-Time)缺陷预测关注于程序的变化而不是整个程序，并且在连续测试中被广泛采用。CC2Vec是最先进的JIT缺陷预测工具，它首先构建了一个分层关注网络(HAN)来学习代码添加和删除的分布式向量表示，然后将它们与另外两个表示提交消息和现有DeepJIT方法提取的整体代码更改的嵌入向量连接起来，以训练预测给定提交是否有缺陷的模型。尽管CC2Vec已经被证明是JIT缺陷预测的最新技术，但是它只在有限的数据集上进行了评估，并且没有与所有具有代表性的基线进行比较。因此，为了进一步研究CC2Vec的有效性和局限性，本文在一个超过310,370个变化的大型数据集(比原始CC2Vec数据集大8.3倍)上对CC2Vec进行了广泛的研究。更具体地说，我们还将CC2Vec与DeepJIT和具有代表性的传统JIT缺陷预测技术进行了经验比较。实验结果表明，CC2Vec不能始终优于DeepJIT，两者都不能始终优于传统JIT缺陷预测。我们还研究了单个传统缺陷预测特征的影响，并发现添加行数特征优于其他传统特征。受此发现的启发，我们构建了一种简单的JIT缺陷预测方法，该方法简单地采用了添加行数特征和逻辑回归分类器。令人惊讶的是，这种简单的方法在缺陷预测方面可以胜过CC2Vec和DeepJIT，并且在训练/测试方面可以快81k /120k X。此外，本文还提供了在不久的将来推进JIT缺陷预测的各种实用指南。

{"title":"Deep just-in-time defect prediction: how far are we?","authors":"Zhen Zeng, Yuqun Zhang, Haotian Zhang, Lingming Zhang","doi":"10.1145/3460319.3464819","DOIUrl":"https://doi.org/10.1145/3460319.3464819","url":null,"abstract":"Defect prediction aims to automatically identify potential defective code with minimal human intervention and has been widely studied in the literature. Just-in-Time (JIT) defect prediction focuses on program changes rather than whole programs, and has been widely adopted in continuous testing. CC2Vec, state-of-the-art JIT defect prediction tool, first constructs a hierarchical attention network (HAN) to learn distributed vector representations of both code additions and deletions, and then concatenates them with two other embedding vectors representing commit messages and overall code changes extracted by the existing DeepJIT approach to train a model for predicting whether a given commit is defective. Although CC2Vec has been shown to be the state of the art for JIT defect prediction, it was only evaluated on a limited dataset and not compared with all representative baselines. Therefore, to further investigate the efficacy and limitations of CC2Vec, this paper performs an extensive study of CC2Vec on a large-scale dataset with over 310,370 changes (8.3 X larger than the original CC2Vec dataset). More specifically, we also empirically compare CC2Vec against DeepJIT and representative traditional JIT defect prediction techniques. The experimental results show that CC2Vec cannot consistently outperform DeepJIT, and neither of them can consistently outperform traditional JIT defect prediction. We also investigate the impact of individual traditional defect prediction features and find that the added-line-number feature outperforms other traditional features. Inspired by this finding, we construct a simplistic JIT defect prediction approach which simply adopts the added-line-number feature with the logistic regression classifier. Surprisingly, such a simplistic approach can outperform CC2Vec and DeepJIT in defect prediction, and can be 81k X/120k X faster in training/testing. Furthermore, the paper also provides various practical guidelines for advancing JIT defect prediction in the near future.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123035773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space 更快，更深入，更容易:从用户空间众包微服务内核故障诊断

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464805

Yicheng Pan, Meng Ma, Xinrui Jiang, Ping Wang

With the widespread use of cloud-native architecture, increasing web applications (apps) choose to build on microservices. Simultaneously, troubleshooting becomes full of challenges owing to the high dynamics and complexity of anomaly propagation. Existing diagnostic methods rely heavily on monitoring metrics collected from the kernel side of microservice systems. Without a comprehensive monitoring infrastructure, application owners and even cloud operators cannot resort to these kernel-space solutions. This paper summarizes several insights on operating a top commercial cloud platform. Then, for the first time, we put forward the idea of user-space diagnosis for microservice kernel failures. To this end, we develop a crowdsourcing solution - DyCause, to resolve the asymmetric diagnostic information problem. DyCause deploys on the application side in a distributed manner. Through lightweight API log sharing, apps collect the operational status of kernel services collaboratively and initiate diagnosis on demand. Deploying DyCause is fast and lightweight as we do not have any architectural and functional requirements for the kernel. To reveal more accurate correlations from asymmetric diagnostic information, we design a novel statistical algorithm that can efficiently discover the time-varying causalities between services. This algorithm also helps us build the temporal order of the anomaly propagation. Therefore, by using DyCause, we can obtain more in-depth and interpretable diagnostic clues with limited indicators. We apply and evaluate DyCause on both a simulated test-bed and a real-world cloud system. Experimental results verify that DyCause running in the user-space outperforms several state-of-the-art algorithms running in the kernel on accuracy. Besides, DyCause shows superior advantages in terms of algorithmic efficiency and data sensitivity. Simply put, DyCause produces a significantly better result than other baselines when analyzing much fewer or sparser metrics. To conclude, DyCause is faster to act, deeper in analysis, and easier to deploy.

随着云原生架构的广泛使用，越来越多的web应用程序(应用程序)选择构建在微服务上。同时，由于异常传播的高动态性和复杂性，故障排除也充满了挑战。现有的诊断方法严重依赖于从微服务系统的内核端收集的监控指标。如果没有全面的监控基础设施，应用程序所有者甚至云运营商都无法求助于这些内核空间解决方案。本文总结了运营顶级商业云平台的几点见解。在此基础上，首次提出了微服务内核故障的用户空间诊断思想。为此，我们开发了一个众包解决方案——DyCause，来解决诊断信息不对称的问题。DyCause以分布式方式部署在应用程序端。通过轻量级API日志共享，应用程序协同收集内核服务的运行状态，并根据需要启动诊断。部署DyCause是快速和轻量级的，因为我们对内核没有任何架构和功能需求。为了从非对称诊断信息中揭示更准确的相关性，我们设计了一种新的统计算法，可以有效地发现服务之间的时变因果关系。该算法还可以帮助我们建立异常传播的时间顺序。因此，通过使用DyCause，我们可以在有限的指标下获得更深入和可解释的诊断线索。我们在一个模拟的测试平台和一个真实的云系统上应用并评估了DyCause。实验结果证实，在用户空间中运行的DyCause在准确性上优于内核中运行的几种最先进的算法。此外，DyCause在算法效率和数据敏感性方面都具有优越的优势。简单地说，当分析更少或更稀疏的指标时，DyCause产生的结果明显优于其他基线。总而言之，DyCause行动更快，分析更深入，并且更容易部署。

{"title":"Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space","authors":"Yicheng Pan, Meng Ma, Xinrui Jiang, Ping Wang","doi":"10.1145/3460319.3464805","DOIUrl":"https://doi.org/10.1145/3460319.3464805","url":null,"abstract":"With the widespread use of cloud-native architecture, increasing web applications (apps) choose to build on microservices. Simultaneously, troubleshooting becomes full of challenges owing to the high dynamics and complexity of anomaly propagation. Existing diagnostic methods rely heavily on monitoring metrics collected from the kernel side of microservice systems. Without a comprehensive monitoring infrastructure, application owners and even cloud operators cannot resort to these kernel-space solutions. This paper summarizes several insights on operating a top commercial cloud platform. Then, for the first time, we put forward the idea of user-space diagnosis for microservice kernel failures. To this end, we develop a crowdsourcing solution - DyCause, to resolve the asymmetric diagnostic information problem. DyCause deploys on the application side in a distributed manner. Through lightweight API log sharing, apps collect the operational status of kernel services collaboratively and initiate diagnosis on demand. Deploying DyCause is fast and lightweight as we do not have any architectural and functional requirements for the kernel. To reveal more accurate correlations from asymmetric diagnostic information, we design a novel statistical algorithm that can efficiently discover the time-varying causalities between services. This algorithm also helps us build the temporal order of the anomaly propagation. Therefore, by using DyCause, we can obtain more in-depth and interpretable diagnostic clues with limited indicators. We apply and evaluate DyCause on both a simulated test-bed and a real-world cloud system. Experimental results verify that DyCause running in the user-space outperforms several state-of-the-art algorithms running in the kernel on accuracy. Besides, DyCause shows superior advantages in terms of algorithmic efficiency and data sensitivity. Simply put, DyCause produces a significantly better result than other baselines when analyzing much fewer or sparser metrics. To conclude, DyCause is faster to act, deeper in analysis, and easier to deploy.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124874793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Finding data compatibility bugs with JSON subschema checking 使用JSON子模式检查查找数据兼容性错误

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464796

Andrew Habib, Avraham Shinnar, Martin Hirzel, Michael Pradel

JSON is a data format used pervasively in web APIs, cloud computing, NoSQL databases, and increasingly also machine learning. To ensure that JSON data is compatible with an application, one can define a JSON schema and use a validator to check data against the schema. However, because validation can happen only once concrete data occurs during an execution, it may detect data compatibility bugs too late or not at all. Examples include evolving the schema for a web API, which may unexpectedly break client applications, or accidentally running a machine learning pipeline on incorrect data. This paper presents a novel way of detecting a class of data compatibility bugs via JSON subschema checking. Subschema checks find bugs before concrete JSON data is available and across all possible data specified by a schema. For example, one can check if evolving a schema would break API clients or if two components of a machine learning pipeline have incompatible expectations about data. Deciding whether one JSON schema is a subschema of another is non-trivial because the JSON Schema specification language is rich. Our key insight to address this challenge is to first reduce the richness of schemas by canonicalizing and simplifying them, and to then reason about the subschema question on simpler schema fragments using type-specific checkers. We apply our subschema checker to thousands of real-world schemas from different domains. In all experiments, the approach is correct whenever it gives an answer (100% precision and correctness), which is the case for most schema pairs (93.5% recall), clearly outperforming the state-of-the-art tool. Moreover, the approach reveals 43 previously unknown bugs in popular software, most of which have already been fixed, showing that JSON subschema checking helps finding data compatibility bugs early.

JSON是一种广泛应用于web api、云计算、NoSQL数据库以及越来越多的机器学习的数据格式。为了确保JSON数据与应用程序兼容，可以定义JSON模式并使用验证器根据模式检查数据。但是，由于验证只能在执行过程中出现具体数据时才会发生，因此可能会太晚或根本无法检测到数据兼容性错误。例子包括发展web API的模式，这可能会意外地破坏客户端应用程序，或者意外地在错误的数据上运行机器学习管道。本文提出了一种通过JSON子模式检查来检测一类数据兼容性错误的新方法。子模式检查在具体的JSON数据可用之前以及在模式指定的所有可能的数据中发现错误。例如，可以检查发展模式是否会破坏API客户端，或者机器学习管道的两个组件是否对数据有不兼容的期望。确定一个JSON模式是否是另一个JSON模式的子模式非常重要，因为JSON模式规范语言非常丰富。我们解决这一挑战的关键方法是首先通过规范化和简化模式来减少模式的丰富程度，然后使用特定类型的检查器在更简单的模式片段上推断子模式问题。我们将子模式检查器应用于来自不同域的数千个实际模式。在所有实验中，该方法在给出答案时都是正确的(100%的准确率和正确性)，对于大多数模式对(93.5%的召回率)来说，这显然优于最先进的工具。此外，该方法还揭示了流行软件中43个以前未知的错误，其中大多数已经修复，这表明JSON子模式检查有助于及早发现数据兼容性错误。

{"title":"Finding data compatibility bugs with JSON subschema checking","authors":"Andrew Habib, Avraham Shinnar, Martin Hirzel, Michael Pradel","doi":"10.1145/3460319.3464796","DOIUrl":"https://doi.org/10.1145/3460319.3464796","url":null,"abstract":"JSON is a data format used pervasively in web APIs, cloud computing, NoSQL databases, and increasingly also machine learning. To ensure that JSON data is compatible with an application, one can define a JSON schema and use a validator to check data against the schema. However, because validation can happen only once concrete data occurs during an execution, it may detect data compatibility bugs too late or not at all. Examples include evolving the schema for a web API, which may unexpectedly break client applications, or accidentally running a machine learning pipeline on incorrect data. This paper presents a novel way of detecting a class of data compatibility bugs via JSON subschema checking. Subschema checks find bugs before concrete JSON data is available and across all possible data specified by a schema. For example, one can check if evolving a schema would break API clients or if two components of a machine learning pipeline have incompatible expectations about data. Deciding whether one JSON schema is a subschema of another is non-trivial because the JSON Schema specification language is rich. Our key insight to address this challenge is to first reduce the richness of schemas by canonicalizing and simplifying them, and to then reason about the subschema question on simpler schema fragments using type-specific checkers. We apply our subschema checker to thousands of real-world schemas from different domains. In all experiments, the approach is correct whenever it gives an answer (100% precision and correctness), which is the case for most schema pairs (93.5% recall), clearly outperforming the state-of-the-art tool. Moreover, the approach reveals 43 previously unknown bugs in popular software, most of which have already been fixed, showing that JSON subschema checking helps finding data compatibility bugs early.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122328306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Seed selection for successful fuzzing 成功模糊的种子选择

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464795

Adrián Herrera, Hendra Gunadi, S. Magrath, Michael Norrish, Mathias Payer, Antony Lloyd Hosking

Mutation-based greybox fuzzing---unquestionably the most widely-used fuzzing technique---relies on a set of non-crashing seed inputs (a corpus) to bootstrap the bug-finding process. When evaluating a fuzzer, common approaches for constructing this corpus include: (i) using an empty file; (ii) using a single seed representative of the target's input format; or (iii) collecting a large number of seeds (e.g., by crawling the Internet). Little thought is given to how this seed choice affects the fuzzing process, and there is no consensus on which approach is best (or even if a best approach exists). To address this gap in knowledge, we systematically investigate and evaluate how seed selection affects a fuzzer's ability to find bugs in real-world software. This includes a systematic review of seed selection practices used in both evaluation and deployment contexts, and a large-scale empirical evaluation (over 33 CPU-years) of six seed selection approaches. These six seed selection approaches include three corpus minimization techniques (which select the smallest subset of seeds that trigger the same range of instrumentation data points as a full corpus). Our results demonstrate that fuzzing outcomes vary significantly depending on the initial seeds used to bootstrap the fuzzer, with minimized corpora outperforming singleton, empty, and large (in the order of thousands of files) seed sets. Consequently, we encourage seed selection to be foremost in mind when evaluating/deploying fuzzers, and recommend that (a) seed choice be carefully considered and explicitly documented, and (b) never to evaluate fuzzers with only a single seed.

基于突变的灰盒模糊测试——毫无疑问是最广泛使用的模糊测试技术——依赖于一组非崩溃种子输入(语料库)来引导bug查找过程。当评估一个模糊器时，构建这个语料库的常用方法包括:(i)使用一个空文件;(ii)使用代表目标输入格式的单一种子;或(iii)收集大量种子(例如，通过在互联网上爬行)。很少有人考虑这种种子选择是如何影响模糊过程的，而且对于哪种方法是最好的(甚至是否存在最佳方法)也没有达成共识。为了解决这一知识缺口，我们系统地调查和评估种子选择如何影响模糊器在现实软件中发现缺陷的能力。这包括对评估和部署环境中使用的种子选择实践的系统回顾，以及对六种种子选择方法的大规模实证评估(超过33个cpu年)。这六种种子选择方法包括三种语料库最小化技术(选择触发与完整语料库相同范围的仪器数据点的种子的最小子集)。我们的结果表明，模糊结果根据用于引导模糊器的初始种子而显着变化，最小化语料库优于单例、空和大型(按数千个文件的顺序)种子集。因此，我们鼓励在评估/部署模糊器时首先考虑种子选择，并建议(a)仔细考虑种子选择并明确记录，(b)永远不要仅使用单一种子来评估模糊器。

{"title":"Seed selection for successful fuzzing","authors":"Adrián Herrera, Hendra Gunadi, S. Magrath, Michael Norrish, Mathias Payer, Antony Lloyd Hosking","doi":"10.1145/3460319.3464795","DOIUrl":"https://doi.org/10.1145/3460319.3464795","url":null,"abstract":"Mutation-based greybox fuzzing---unquestionably the most widely-used fuzzing technique---relies on a set of non-crashing seed inputs (a corpus) to bootstrap the bug-finding process. When evaluating a fuzzer, common approaches for constructing this corpus include: (i) using an empty file; (ii) using a single seed representative of the target's input format; or (iii) collecting a large number of seeds (e.g., by crawling the Internet). Little thought is given to how this seed choice affects the fuzzing process, and there is no consensus on which approach is best (or even if a best approach exists). To address this gap in knowledge, we systematically investigate and evaluate how seed selection affects a fuzzer's ability to find bugs in real-world software. This includes a systematic review of seed selection practices used in both evaluation and deployment contexts, and a large-scale empirical evaluation (over 33 CPU-years) of six seed selection approaches. These six seed selection approaches include three corpus minimization techniques (which select the smallest subset of seeds that trigger the same range of instrumentation data points as a full corpus). Our results demonstrate that fuzzing outcomes vary significantly depending on the initial seeds used to bootstrap the fuzzer, with minimized corpora outperforming singleton, empty, and large (in the order of thousands of files) seed sets. Consequently, we encourage seed selection to be foremost in mind when evaluating/deploying fuzzers, and recommend that (a) seed choice be carefully considered and explicitly documented, and (b) never to evaluate fuzzers with only a single seed.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116158567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Automated patch backporting in Linux (experience paper) Linux中的自动补丁后移植(经验论文)

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464821

Ridwan Shariffdeen, Xiang Gao, Gregory J. Duck, Shin Hwei Tan, J. Lawall, Abhik Roychoudhury

Whenever a bug or vulnerability is detected in the Linux kernel, the kernel developers will endeavour to fix it by introducing a patch into the mainline version of the Linux kernel source tree. However, many users run older “stable” versions of Linux, meaning that the patch should also be “backported” to one or more of these older kernel versions. This process is error-prone and there is usually along delay in publishing the backported patch. Based on an empirical study, we show that around 8% of all commits submitted to Linux mainline are backported to older versions,but often more than one month elapses before the backport is available. Hence, we propose a patch backporting technique that can automatically transfer patches from the mainline version of Linux into older stable versions. Our approach first synthesizes a partial transformation rule based on a Linux mainline patch. This rule can then be generalized by analysing the alignment between the mainline and target versions. The generalized rule is then applied to the target version to produce a backported patch. We have implemented our transformation technique in a tool called FixMorph and evaluated it on 350 Linux mainline patches. FixMorph correctly backports 75.1% of them. Compared to existing techniques, FixMorph improves both the precision and recall in backporting patches. Apart from automation of software maintenance tasks, patch backporting helps in reducing the exposure to known security vulnerabilities in stable versions of the Linux kernel.

每当在Linux内核中检测到错误或漏洞时，内核开发人员将努力通过在Linux内核源代码树的主线版本中引入补丁来修复它。然而，许多用户运行旧的“稳定”版本的Linux，这意味着补丁也应该“反向移植”到一个或多个这些旧的内核版本。这个过程很容易出错，并且通常会延迟发布后移植补丁。根据一项实证研究，我们发现，提交到Linux主线的所有提交中，大约有8%被反向移植到旧版本，但通常一个多月后，反向移植才可用。因此，我们提出了一种补丁后移植技术，可以自动将补丁从Linux的主流版本转移到较旧的稳定版本。我们的方法首先综合了一个基于Linux主线补丁的部分转换规则。这个规则可以通过分析主线版本和目标版本之间的一致性来推广。然后将一般化规则应用于目标版本以生成后移植补丁。我们已经在一个名为FixMorph的工具中实现了我们的转换技术，并在350个Linux主线补丁上对其进行了评估。FixMorph正确地支持75.1%。与现有技术相比，FixMorph提高了后移植补丁的精度和召回率。除了自动化软件维护任务之外，补丁后移植还有助于减少Linux内核稳定版本中已知安全漏洞的暴露。

{"title":"Automated patch backporting in Linux (experience paper)","authors":"Ridwan Shariffdeen, Xiang Gao, Gregory J. Duck, Shin Hwei Tan, J. Lawall, Abhik Roychoudhury","doi":"10.1145/3460319.3464821","DOIUrl":"https://doi.org/10.1145/3460319.3464821","url":null,"abstract":"Whenever a bug or vulnerability is detected in the Linux kernel, the kernel developers will endeavour to fix it by introducing a patch into the mainline version of the Linux kernel source tree. However, many users run older “stable” versions of Linux, meaning that the patch should also be “backported” to one or more of these older kernel versions. This process is error-prone and there is usually along delay in publishing the backported patch. Based on an empirical study, we show that around 8% of all commits submitted to Linux mainline are backported to older versions,but often more than one month elapses before the backport is available. Hence, we propose a patch backporting technique that can automatically transfer patches from the mainline version of Linux into older stable versions. Our approach first synthesizes a partial transformation rule based on a Linux mainline patch. This rule can then be generalized by analysing the alignment between the mainline and target versions. The generalized rule is then applied to the target version to produce a backported patch. We have implemented our transformation technique in a tool called FixMorph and evaluated it on 350 Linux mainline patches. FixMorph correctly backports 75.1% of them. Compared to existing techniques, FixMorph improves both the precision and recall in backporting patches. Apart from automation of software maintenance tasks, patch backporting helps in reducing the exposure to known security vulnerabilities in stable versions of the Linux kernel.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117186833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Continuous test suite failure prediction 持续的测试套件故障预测

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464840

Cong Pan, Michael Pradel

Continuous integration advocates to run the test suite of a project frequently, e.g., for every code change committed to a shared repository. This process imposes a high computational cost and sometimes also a high human cost, e.g., when developers must wait for the test suite to pass before a change appears in the main branch of the shared repository. However, only 4% of all test suite invocations turn a previously passing test suite into a failing test suite. The question arises whether running the test suite for each code change is really necessary. This paper presents continuous test suite failure prediction, which reduces the cost of continuous integration by predicting whether a particular code change should trigger the test suite at all. The core of the approach is a machine learning model based on features of the code change, the test suite, and the development history. We also present a theoretical cost model that describes when continuous test suite failure prediction is worthwhile. Evaluating the idea with 15k test suite runs from 242 open-source projects shows that the approach is effective at predicting whether running the test suite is likely to reveal a test failure. Moreover, we find that our approach improves the AUC over baselines that use features proposed for just-in-time defect prediction and test case failure prediction by 13.9% and 2.9%, respectively. Overall, continuous test suite failure prediction can significantly reduce the cost of continuous integration.

持续集成提倡频繁地运行项目的测试套件，例如，对于提交到共享存储库的每个代码更改。这个过程带来了很高的计算成本，有时也带来了很高的人力成本，例如，开发人员必须等待测试套件通过，然后才会在共享存储库的主分支中出现更改。然而，只有4%的测试套件调用将先前通过的测试套件变为失败的测试套件。问题出现了，是否真的有必要为每个代码更改运行测试套件。本文提出了持续测试套件故障预测，通过预测一个特定的代码更改是否应该触发测试套件来降低持续集成的成本。该方法的核心是基于代码变更、测试套件和开发历史的特征的机器学习模型。我们还提出了一个理论成本模型，该模型描述了何时值得进行持续测试套件故障预测。用242个开源项目中运行的15k个测试套件来评估这个想法，表明该方法在预测运行测试套件是否可能揭示测试失败方面是有效的。此外，我们发现我们的方法在使用为及时缺陷预测和测试用例失败预测提出的特征的基线上分别提高了13.9%和2.9%的AUC。总的来说，持续测试套件故障预测可以显著降低持续集成的成本。

{"title":"Continuous test suite failure prediction","authors":"Cong Pan, Michael Pradel","doi":"10.1145/3460319.3464840","DOIUrl":"https://doi.org/10.1145/3460319.3464840","url":null,"abstract":"Continuous integration advocates to run the test suite of a project frequently, e.g., for every code change committed to a shared repository. This process imposes a high computational cost and sometimes also a high human cost, e.g., when developers must wait for the test suite to pass before a change appears in the main branch of the shared repository. However, only 4% of all test suite invocations turn a previously passing test suite into a failing test suite. The question arises whether running the test suite for each code change is really necessary. This paper presents continuous test suite failure prediction, which reduces the cost of continuous integration by predicting whether a particular code change should trigger the test suite at all. The core of the approach is a machine learning model based on features of the code change, the test suite, and the development history. We also present a theoretical cost model that describes when continuous test suite failure prediction is worthwhile. Evaluating the idea with 15k test suite runs from 242 open-source projects shows that the approach is effective at predicting whether running the test suite is likely to reveal a test failure. Moreover, we find that our approach improves the AUC over baselines that use features proposed for just-in-time defect prediction and test case failure prediction by 13.9% and 2.9%, respectively. Overall, continuous test suite failure prediction can significantly reduce the cost of continuous integration.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129218135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Empirical evaluation of smart contract testing: what is the best choice? 智能合约测试的实证评估:什么是最佳选择?

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464837

Meng Ren, Zijing Yin, Fuchen Ma, Zhenyang Xu, Yu Jiang, Chengnian Sun, Huizhong Li, Yan Cai

Security of smart contracts has attracted increasing attention in recent years. Many researchers have devoted themselves to devising testing tools for vulnerability detection. Each published tool has demonstrated its effectiveness through a series of evaluations on their own experimental scenarios. However, the inconsistency of evaluation settings such as different data sets or performance metrics, may result in biased conclusion. In this paper, based on an empirical evaluation of widely used smart contract testing tools, we propose a unified standard to eliminate the bias in the assessment process. First, we collect 46,186 source-available smart contracts from four influential organizations. This comprehensive dataset is open to the public and involves different code characteristics, vulnerability patterns and application scenarios. Then we propose a 4-step evaluation process and summarize the difference among relevant work in these steps. We use nine representative tools to carry out extensive experiments. The results demonstrate that different choices of experimental settings could significantly affect tool performance and lead to misleading or even opposite conclusions. Finally, we generalize some problems of existing testing tools, and propose some possible directions for further improvement.

近年来，智能合约的安全性越来越受到人们的关注。许多研究人员致力于设计漏洞检测的测试工具。每个已发布的工具都通过对自己的实验场景进行一系列评估来证明其有效性。然而，评估设置的不一致，如不同的数据集或性能指标，可能导致有偏见的结论。在本文中，基于对广泛使用的智能合约测试工具的实证评估，我们提出了一个统一的标准来消除评估过程中的偏见。首先，我们从四个有影响力的组织收集了46,186个来源可用的智能合约。该综合数据集面向公众开放，涉及不同的代码特征、漏洞模式和应用场景。然后，我们提出了一个4步评估流程，并总结了这些步骤中相关工作的差异。我们使用了9种具有代表性的工具进行了广泛的实验。结果表明，实验设置的不同选择会显著影响刀具性能，并导致误导性甚至相反的结论。最后总结了现有测试工具存在的问题，并提出了进一步改进的方向。

{"title":"Empirical evaluation of smart contract testing: what is the best choice?","authors":"Meng Ren, Zijing Yin, Fuchen Ma, Zhenyang Xu, Yu Jiang, Chengnian Sun, Huizhong Li, Yan Cai","doi":"10.1145/3460319.3464837","DOIUrl":"https://doi.org/10.1145/3460319.3464837","url":null,"abstract":"Security of smart contracts has attracted increasing attention in recent years. Many researchers have devoted themselves to devising testing tools for vulnerability detection. Each published tool has demonstrated its effectiveness through a series of evaluations on their own experimental scenarios. However, the inconsistency of evaluation settings such as different data sets or performance metrics, may result in biased conclusion. In this paper, based on an empirical evaluation of widely used smart contract testing tools, we propose a unified standard to eliminate the bias in the assessment process. First, we collect 46,186 source-available smart contracts from four influential organizations. This comprehensive dataset is open to the public and involves different code characteristics, vulnerability patterns and application scenarios. Then we propose a 4-step evaluation process and summarize the difference among relevant work in these steps. We use nine representative tools to carry out extensive experiments. The results demonstrate that different choices of experimental settings could significantly affect tool performance and lead to misleading or even opposite conclusions. Finally, we generalize some problems of existing testing tools, and propose some possible directions for further improvement.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131343041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

echidna-parade: a tool for diverse multicore smart contract fuzzing Echidna-parade:多种多核智能合约模糊测试工具

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3469076

Alex Groce, Gustavo Grieco

Echidna is a widely used fuzzer for Ethereum Virtual Machine (EVM) compatible blockchain smart contracts that generates transaction sequences of calls to smart contracts. While Echidna is an essentially single-threaded tool, it is possible for multiple Echidna processes to communicate by use of a shared transaction sequence corpus. Echidna provides a very large variety of configuration options, since each smart contract may be best-tested by a non-default configuration, and different faults or coverage targets within a single contract may also have differing ideal configurations. This paper presents echidna-parade, a tool that provides pushbutton multicore fuzzing using Echidna as an underlying fuzzing engine, and automatically provides sophisticated diversification of configurations. Even without using multiple cores, echidna-parade can improve the effectiveness of fuzzing with Echidna, due to the advantages provided by multiple types of test configuration diversity. Using echidna-parade with multiple cores can produce significantly better results than Echidna, in less time.

Echidna是一个广泛使用的以太坊虚拟机(EVM)兼容区块链智能合约的fuzzer，它生成调用智能合约的交易序列。虽然Echidna本质上是一个单线程工具，但多个Echidna进程可以通过使用共享事务序列语料库进行通信。Echidna提供了非常多的配置选项，因为每个智能合约都可以通过非默认配置进行最佳测试，并且单个合约中的不同故障或覆盖目标也可能具有不同的理想配置。本文介绍了Echidna -parade，一个使用Echidna作为底层模糊测试引擎提供按钮多核模糊测试的工具，并自动提供复杂的配置多样化。即使不使用多个内核，针鼹巡游也可以提高针鼹模糊测试的有效性，这是由于多种测试配置多样性所提供的优势。使用多核针鼹巡游可以在更短的时间内产生比针鼹更好的效果。

引用次数: 8

Modular call graph construction for security scanning of Node.js applications 模块化调用图构建，用于安全扫描Node.js应用程序

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464836

Benjamin Barslev Nielsen, Martin Toldam Torp, Anders Møller

Most of the code in typical Node.js applications comes from third-party libraries that consist of a large number of interdependent modules. Because of the dynamic features of JavaScript, it is difficult to obtain detailed information about the module dependencies, which is vital for reasoning about the potential consequences of security vulnerabilities in libraries, and for many other software development tasks. The underlying challenge is how to construct precise call graphs that capture the connectivity between functions in the modules. In this work we present a novel approach to call graph construction for Node.js applications that is modular, taking into account the modular structure of Node.js applications, and sufficiently accurate and efficient to be practically useful. We demonstrate experimentally that the constructed call graphs are useful for security scanning, reducing the number of false positives by 81% compared to npm audit and with zero false negatives. Compared to js-callgraph, the call graph construction is significantly more accurate and efficient. The experiments also show that the analysis time is reduced substantially when reusing modular call graphs.

典型Node.js应用程序中的大部分代码来自第三方库，这些库由大量相互依赖的模块组成。由于JavaScript的动态特性，很难获得关于模块依赖关系的详细信息，而这些信息对于推断库中安全漏洞的潜在后果以及许多其他软件开发任务至关重要。潜在的挑战是如何构建精确的调用图来捕获模块中函数之间的连通性。在这项工作中，我们提出了一种新颖的方法来为Node.js应用程序构建调用图，这种方法是模块化的，考虑到Node.js应用程序的模块化结构，并且足够准确和高效，可以在实际中使用。我们通过实验证明，构建的调用图对安全扫描很有用，与npm审计相比，误报的数量减少了81%，并且没有误报。与js-callgraph相比，调用图的构建更加准确和高效。实验还表明，重用模块化调用图大大减少了分析时间。

{"title":"Modular call graph construction for security scanning of Node.js applications","authors":"Benjamin Barslev Nielsen, Martin Toldam Torp, Anders Møller","doi":"10.1145/3460319.3464836","DOIUrl":"https://doi.org/10.1145/3460319.3464836","url":null,"abstract":"Most of the code in typical Node.js applications comes from third-party libraries that consist of a large number of interdependent modules. Because of the dynamic features of JavaScript, it is difficult to obtain detailed information about the module dependencies, which is vital for reasoning about the potential consequences of security vulnerabilities in libraries, and for many other software development tasks. The underlying challenge is how to construct precise call graphs that capture the connectivity between functions in the modules. In this work we present a novel approach to call graph construction for Node.js applications that is modular, taking into account the modular structure of Node.js applications, and sufficiently accurate and efficient to be practically useful. We demonstrate experimentally that the constructed call graphs are useful for security scanning, reducing the number of false positives by 81% compared to npm audit and with zero false negatives. Compared to js-callgraph, the call graph construction is significantly more accurate and efficient. The experiments also show that the analysis time is reduced substantially when reusing modular call graphs.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122159767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

RAProducer: efficiently diagnose and reproduce data race bugs for binaries via trace analysis RAProducer:通过跟踪分析有效地诊断和重现二进制文件的数据竞争错误

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464831

Ming Yuan, Yeseop Lee, Chao Zhang, Yun Li, Yan Cai, Bodong Zhao

A growing number of bugs have been reported by vulnerability discovery solutions. Among them, some bugs are hard to diagnose or reproduce, including data race bugs caused by thread interleavings. Few solutions are able to well address this issue, due to the huge space of interleavings to explore. What’s worse, in security analysis scenarios, analysts usually have no access to the source code of target programs and have troubles in comprehending them. In this paper, we propose a general solution RAProducer to efficiently diagnose and reproduce data race bugs, for both user-land binary programs and kernels without source code. The efficiency of RAProducer is achieved by analyzing the execution trace of the given PoC (proof-of-concept) sample to recognize race- and bug-related elements (including locks and shared variables), which greatly facilitate narrowing down the huge search space of data race spots and thread interleavings. We have implemented a prototype of RAProducer and evaluated it on 7 kernel and 10 user-land data race bugs. Evaluation results showed that, RAProducer is effective at reproducing all these bugs. More importantly, it enables us to diagnose 2 extra real world bugs which are left unconfirmed for a long time. It is also efficient as it reduces candidate data race spots of each bug to a small set, and narrows down the thread interleaving greatly.RAProducer is also more effective in reproducing real-world data race bugs than other state-of-the-art solutions.

漏洞发现解决方案报告了越来越多的bug。其中，有些bug很难诊断或重现，包括线程交错导致的数据争用bug。很少有解决方案能够很好地解决这个问题，因为交错的探索空间很大。更糟糕的是，在安全分析场景中，分析人员通常无法访问目标程序的源代码，因此很难理解它们。在本文中，我们提出了一个通用的解决方案RAProducer来有效地诊断和重现数据竞争错误，适用于用户端二进制程序和没有源代码的内核。RAProducer的效率是通过分析给定PoC(概念验证)示例的执行跟踪来实现的，以识别与竞争和bug相关的元素(包括锁和共享变量)，这极大地缩小了数据竞争点和线程交错的巨大搜索空间。我们已经实现了一个RAProducer的原型，并在7个内核和10个用户数据竞争错误上对其进行了评估。评价结果表明，RAProducer能够有效地再现所有这些bug。更重要的是，它使我们能够诊断两个额外的、长期未被确认的现实世界bug。它也很有效，因为它将每个bug的候选数据竞争点减少到一个小集合，并大大缩小了线程交错。在再现真实世界的数据竞争错误方面，RAProducer也比其他最先进的解决方案更有效。

{"title":"RAProducer: efficiently diagnose and reproduce data race bugs for binaries via trace analysis","authors":"Ming Yuan, Yeseop Lee, Chao Zhang, Yun Li, Yan Cai, Bodong Zhao","doi":"10.1145/3460319.3464831","DOIUrl":"https://doi.org/10.1145/3460319.3464831","url":null,"abstract":"A growing number of bugs have been reported by vulnerability discovery solutions. Among them, some bugs are hard to diagnose or reproduce, including data race bugs caused by thread interleavings. Few solutions are able to well address this issue, due to the huge space of interleavings to explore. What’s worse, in security analysis scenarios, analysts usually have no access to the source code of target programs and have troubles in comprehending them. In this paper, we propose a general solution RAProducer to efficiently diagnose and reproduce data race bugs, for both user-land binary programs and kernels without source code. The efficiency of RAProducer is achieved by analyzing the execution trace of the given PoC (proof-of-concept) sample to recognize race- and bug-related elements (including locks and shared variables), which greatly facilitate narrowing down the huge search space of data race spots and thread interleavings. We have implemented a prototype of RAProducer and evaluated it on 7 kernel and 10 user-land data race bugs. Evaluation results showed that, RAProducer is effective at reproducing all these bugs. More importantly, it enables us to diagnose 2 extra real world bugs which are left unconfirmed for a long time. It is also efficient as it reduces candidate data race spots of each bug to a small set, and narrows down the thread interleaving greatly.RAProducer is also more effective in reproducing real-world data race bugs than other state-of-the-art solutions.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132462039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4