Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis最新文献

On the use of mutation analysis for evaluating student test suite quality 关于使用突变分析来评估学生测试套件质量

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534217

J. Perretta, A. DeOrio, Arjun Guha, Jonathan Bell

A common practice in computer science courses is to evaluate student-written test suites against either a set of manually-seeded faults (handwritten by an instructor) or against all other student-written implementations (“all-pairs” grading). However, manually seeding faults is a time consuming and potentially error-prone process, and the all-pairs approach requires significant manual and computational effort to apply fairly and accurately. Mutation analysis, which automatically seeds potential faults in an implementation, is a possible alternative to these test suite evaluation approaches. Although there is evidence in the literature that mutants are a valid substitute for real faults in large open-source software projects, it is unclear whether mutants are representative of the kinds of faults that students make. If mutants are a valid substitute for faults found in student-written code, and if mutant detection is correlated with manually-seeded fault detection and faulty student implementation detection, then instructors can instead evaluate student test suites using mutants generated by open-source mutation analysis tools. Using a dataset of 2,711 student assignment submissions, we empirically evaluate whether mutation score is a good proxy for manually-seeded fault detection rate and faulty student implementation detection rate. Our results show a strong correlation between mutation score and manually-seeded fault detection rate and a moderately strong correlation between mutation score and faulty student implementation detection. We identify a handful of faults in student implementations that, to be coupled to a mutant, would require new or stronger mutation operators or applying mutation operators to an implementation with a different structure than the instructor-written implementation. We also find that this correlation is limited by the fact that faults are not distributed evenly throughout student code, a known drawback of all-pairs grading. Our results suggest that mutants produced by open-source mutation analysis tools are of equal or higher quality than manually-seeded faults and a reasonably good stand-in for real faults in student implementations. Our findings have implications for software testing researchers, educators, and tool builders alike.

在计算机科学课程中，一个常见的做法是根据一组人工播种的错误(由教师手写)或所有其他学生编写的实现(“全对”评分)来评估学生编写的测试套件。然而，手动播种故障是一个耗时且容易出错的过程，并且全对方法需要大量的人工和计算工作才能公平准确地应用。突变分析可以自动为实现中的潜在错误播种，是这些测试套件评估方法的可能替代方法。尽管文献中有证据表明，突变体是大型开源软件项目中真实错误的有效替代品，但尚不清楚突变体是否代表了学生犯的各种错误。如果突变体是在学生编写的代码中发现的错误的有效替代品，并且如果突变体检测与手动播种的错误检测和错误的学生实现检测相关联，那么教师可以使用由开源突变分析工具生成的突变体来评估学生测试套件。使用2711个学生提交的作业数据集，我们实证地评估了突变分数是否可以很好地代表人工播种的错误检测率和错误的学生执行检测率。我们的研究结果表明，突变分数与人工播种的故障检测率之间存在很强的相关性，而突变分数与错误的学生实施检测之间存在中等强的相关性。我们确定了学生实现中的一些错误，这些错误与突变相耦合，将需要新的或更强的突变操作符，或者将突变操作符应用于具有与讲师编写的实现不同结构的实现。我们还发现，这种相关性受到以下事实的限制:错误在整个学生代码中并不均匀分布，这是全对评分的一个已知缺点。我们的结果表明，由开源突变分析工具产生的突变与人工播种的故障质量相同或更高，并且是学生实现中真正故障的合理替代品。我们的发现对软件测试研究者、教育者和工具构建者都有启示意义。

{"title":"On the use of mutation analysis for evaluating student test suite quality","authors":"J. Perretta, A. DeOrio, Arjun Guha, Jonathan Bell","doi":"10.1145/3533767.3534217","DOIUrl":"https://doi.org/10.1145/3533767.3534217","url":null,"abstract":"A common practice in computer science courses is to evaluate student-written test suites against either a set of manually-seeded faults (handwritten by an instructor) or against all other student-written implementations (“all-pairs” grading). However, manually seeding faults is a time consuming and potentially error-prone process, and the all-pairs approach requires significant manual and computational effort to apply fairly and accurately. Mutation analysis, which automatically seeds potential faults in an implementation, is a possible alternative to these test suite evaluation approaches. Although there is evidence in the literature that mutants are a valid substitute for real faults in large open-source software projects, it is unclear whether mutants are representative of the kinds of faults that students make. If mutants are a valid substitute for faults found in student-written code, and if mutant detection is correlated with manually-seeded fault detection and faulty student implementation detection, then instructors can instead evaluate student test suites using mutants generated by open-source mutation analysis tools. Using a dataset of 2,711 student assignment submissions, we empirically evaluate whether mutation score is a good proxy for manually-seeded fault detection rate and faulty student implementation detection rate. Our results show a strong correlation between mutation score and manually-seeded fault detection rate and a moderately strong correlation between mutation score and faulty student implementation detection. We identify a handful of faults in student implementations that, to be coupled to a mutant, would require new or stronger mutation operators or applying mutation operators to an implementation with a different structure than the instructor-written implementation. We also find that this correlation is limited by the fact that faults are not distributed evenly throughout student code, a known drawback of all-pairs grading. Our results suggest that mutants produced by open-source mutation analysis tools are of equal or higher quality than manually-seeded faults and a reasonably good stand-in for real faults in student implementations. Our findings have implications for software testing researchers, educators, and tool builders alike.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128109984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Patch correctness assessment in automated program repair based on the impact of patches on production and test code 基于补丁对生产和测试代码的影响，自动程序修复中的补丁正确性评估

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534368

Ali Ghanbari, Andrian Marcus

Test-based generate-and-validate automated program repair (APR) systems often generate many patches that pass the test suite without fixing the bug. The generated patches must be manually inspected by the developers, so previous research proposed various techniques for automatic correctness assessment of APR-generated patches. Among them, dynamic patch correctness assessment techniques rely on the assumption that, when running the originally passing test cases, the correct patches will not alter the program behavior in a significant way, e.g., removing the code implementing correct functionality of the program. In this paper, we propose and evaluate a novel technique, named Shibboleth, for automatic correctness assessment of the patches generated by test-based generate-and-validate APR systems. Unlike existing works, the impact of the patches is captured along three complementary facets, allowing more effective patch correctness assessment. Specifically, we measure the impact of patches on both production code (via syntactic and semantic similarity) and test code (via code coverage of passing tests) to separate the patches that result in similar programs and that do not delete desired program elements. Shibboleth assesses the correctness of patches via both ranking and classification. We evaluated Shibboleth on 1,871 patches, generated by 29 Java-based APR systems for Defects4J programs. The technique outperforms state-of-the-art ranking and classification techniques. Specifically, in our ranking data set, in 43% (66%) of the cases, Shibboleth ranks the correct patch in top-1 (top-2) positions, and in classification mode applied on our classification data set, it achieves an accuracy and F1-score of 0.887 and 0.852, respectively.

基于测试的生成并验证自动程序修复(APR)系统通常生成许多补丁，这些补丁通过了测试套件，但没有修复错误。生成的补丁必须由开发人员手工检查，因此以前的研究提出了各种技术来自动评估apr生成的补丁的正确性。其中，动态补丁正确性评估技术依赖于这样一个假设，即在运行最初通过的测试用例时，正确的补丁不会显著改变程序行为，例如，删除实现程序正确功能的代码。在本文中，我们提出并评估了一种名为Shibboleth的新技术，用于对基于测试的生成和验证APR系统生成的补丁进行自动正确性评估。与现有的工作不同，补丁的影响是沿着三个互补的方面捕获的，允许更有效的补丁正确性评估。具体来说，我们测量补丁对生产代码(通过语法和语义相似性)和测试代码(通过通过测试的代码覆盖率)的影响，以分离导致类似程序和不删除所需程序元素的补丁。Shibboleth通过排序和分类来评估补丁的正确性。我们在1871个补丁上对Shibboleth进行了评估，这些补丁由29个基于java的APR系统为缺陷4j程序生成。该技术优于最先进的排名和分类技术。具体来说，在我们的排序数据集中，在43%(66%)的案例中，Shibboleth将正确patch排在top-1 (top-2)的位置，在我们的分类数据集中应用的分类模式下，它的准确率和f1得分分别达到了0.887和0.852。

{"title":"Patch correctness assessment in automated program repair based on the impact of patches on production and test code","authors":"Ali Ghanbari, Andrian Marcus","doi":"10.1145/3533767.3534368","DOIUrl":"https://doi.org/10.1145/3533767.3534368","url":null,"abstract":"Test-based generate-and-validate automated program repair (APR) systems often generate many patches that pass the test suite without fixing the bug. The generated patches must be manually inspected by the developers, so previous research proposed various techniques for automatic correctness assessment of APR-generated patches. Among them, dynamic patch correctness assessment techniques rely on the assumption that, when running the originally passing test cases, the correct patches will not alter the program behavior in a significant way, e.g., removing the code implementing correct functionality of the program. In this paper, we propose and evaluate a novel technique, named Shibboleth, for automatic correctness assessment of the patches generated by test-based generate-and-validate APR systems. Unlike existing works, the impact of the patches is captured along three complementary facets, allowing more effective patch correctness assessment. Specifically, we measure the impact of patches on both production code (via syntactic and semantic similarity) and test code (via code coverage of passing tests) to separate the patches that result in similar programs and that do not delete desired program elements. Shibboleth assesses the correctness of patches via both ranking and classification. We evaluated Shibboleth on 1,871 patches, generated by 29 Java-based APR systems for Defects4J programs. The technique outperforms state-of-the-art ranking and classification techniques. Specifically, in our ranking data set, in 43% (66%) of the cases, Shibboleth ranks the correct patch in top-1 (top-2) positions, and in classification mode applied on our classification data set, it achieves an accuracy and F1-score of 0.887 and 0.852, respectively.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130971028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

On the use of evaluation measures for defect prediction studies 缺陷预测研究中评价方法的应用

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534405

Rebecca Moussa, Federica Sarro

Software defect prediction research has adopted various evaluation measures to assess the performance of prediction models. In this paper, we further stress on the importance of the choice of appropriate measures in order to correctly assess strengths and weaknesses of a given defect prediction model, especially given that most of the defect prediction tasks suffer from data imbalance. Investigating 111 previous studies published between 2010 and 2020, we found out that over a half either use only one evaluation measure, which alone cannot express all the characteristics of model performance in presence of imbalanced data, or a set of binary measures which are prone to be biased when used to assess models especially when trained with imbalanced data. We also unveil the magnitude of the impact of assessing popular defect prediction models with several evaluation measures based, for the first time, on both statistical significance test and effect size analyses. Our results reveal that the evaluation measures produce a different ranking of the classification models in 82% and 85% of the cases studied according to the Wilcoxon statistical significance test and Â12 effect size, respectively. Further, we observe a very high rank disruption (between 64% to 92% on average) for each of the measures investigated. This signifies that, in the majority of the cases, a prediction technique that would be believed to be better than others when using a given evaluation measure becomes worse when using a different one. We conclude by providing some recommendations for the selection of appropriate evaluation measures based on factors which are specific to the problem at hand such as the class distribution of the training data, the way in which the model has been built and will be used. Moreover, we recommend to include in the set of evaluation measures, at least one able to capture the full picture of the confusion matrix, such as MCC. This will enable researchers to assess whether proposals made in previous work can be applied for purposes different than the ones they were originally intended for. Besides, we recommend to report, whenever possible, the raw confusion matrix to allow other researchers to compute any measure of interest thereby making it feasible to draw meaningful observations across different studies.

软件缺陷预测研究采用了多种评估方法来评估预测模型的性能。在本文中，我们进一步强调了选择合适的度量的重要性，以便正确评估给定缺陷预测模型的优缺点，特别是考虑到大多数缺陷预测任务遭受数据不平衡。通过调查2010年至2020年间发表的111项研究，我们发现，超过一半的研究要么只使用一种评估指标，这种指标无法单独表达存在不平衡数据时模型性能的所有特征，要么使用一组二元指标，这些指标在用于评估模型时容易产生偏差，尤其是在使用不平衡数据进行训练时。我们还首次揭示了基于统计显著性检验和效应大小分析的几种评估方法评估流行缺陷预测模型的影响程度。我们的研究结果显示，根据Wilcoxon统计显著性检验和Â12效应大小，评价措施分别在82%和85%的研究案例中产生了不同的分类模型排名。此外，我们观察到每个被调查的措施都有非常高的秩中断(平均在64%到92%之间)。这表明，在大多数情况下，当使用给定的评估方法时，被认为比其他方法更好的预测技术在使用不同的评估方法时变得更差。最后，我们根据特定于手头问题的因素，如训练数据的类别分布、模型的构建和使用方式，为选择适当的评估措施提供了一些建议。此外，我们建议在一组评估措施中，至少包括一个能够捕捉混淆矩阵的全貌，例如MCC。这将使研究人员能够评估在以前的工作中提出的建议是否可以应用于不同于最初预期的目的。此外，我们建议尽可能报告原始混淆矩阵，以允许其他研究人员计算任何感兴趣的度量，从而使在不同研究中得出有意义的观察结果成为可能。

{"title":"On the use of evaluation measures for defect prediction studies","authors":"Rebecca Moussa, Federica Sarro","doi":"10.1145/3533767.3534405","DOIUrl":"https://doi.org/10.1145/3533767.3534405","url":null,"abstract":"Software defect prediction research has adopted various evaluation measures to assess the performance of prediction models. In this paper, we further stress on the importance of the choice of appropriate measures in order to correctly assess strengths and weaknesses of a given defect prediction model, especially given that most of the defect prediction tasks suffer from data imbalance. Investigating 111 previous studies published between 2010 and 2020, we found out that over a half either use only one evaluation measure, which alone cannot express all the characteristics of model performance in presence of imbalanced data, or a set of binary measures which are prone to be biased when used to assess models especially when trained with imbalanced data. We also unveil the magnitude of the impact of assessing popular defect prediction models with several evaluation measures based, for the first time, on both statistical significance test and effect size analyses. Our results reveal that the evaluation measures produce a different ranking of the classification models in 82% and 85% of the cases studied according to the Wilcoxon statistical significance test and Â12 effect size, respectively. Further, we observe a very high rank disruption (between 64% to 92% on average) for each of the measures investigated. This signifies that, in the majority of the cases, a prediction technique that would be believed to be better than others when using a given evaluation measure becomes worse when using a different one. We conclude by providing some recommendations for the selection of appropriate evaluation measures based on factors which are specific to the problem at hand such as the class distribution of the training data, the way in which the model has been built and will be used. Moreover, we recommend to include in the set of evaluation measures, at least one able to capture the full picture of the confusion matrix, such as MCC. This will enable researchers to assess whether proposals made in previous work can be applied for purposes different than the ones they were originally intended for. Besides, we recommend to report, whenever possible, the raw confusion matrix to allow other researchers to compute any measure of interest thereby making it feasible to draw meaningful observations across different studies.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116951501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Human-in-the-loop oracle learning for semantic bugs in string processing programs 人在循环oracle学习字符串处理程序中的语义错误

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534406

Charaka Geethal Kapugama, Van-Thuan Pham, A. Aleti, Marcel Böhme

How can we automatically repair semantic bugs in string-processing programs? A semantic bug is an unexpected program state: The program does not crash (which can be easily detected). Instead, the program processes the input incorrectly. It produces an output which users identify as unexpected. We envision a fully automated debugging process for semantic bugs where a user reports the unexpected behavior for a given input and the machine negotiates the condition under which the program fails. During the negotiation, the machine learns to predict the user's response and in this process learns an automated oracle for semantic bugs. In this paper, we introduce Grammar2Fix, an automated oracle learning and debugging technique for string-processing programs even when the input format is unknown. Grammar2Fix represents the oracle as a regular grammar which is iteratively improved by systematic queries to the user for other inputs that are likely failing. Grammar2Fix implements several heuristics to maximize the oracle quality under a minimal query budget. In our experiments with 3 widely-used repair benchmark sets, Grammar2Fix predicts passing inputs as passing and failing inputs as failing with more than 96% precision and recall, using a median of 42 queries to the user.

我们如何自动修复字符串处理程序中的语义错误?语义错误是一种意想不到的程序状态:程序不会崩溃(很容易检测到)。相反，程序错误地处理了输入。它产生一个用户认为是意外的输出。我们设想了一个完全自动化的语义错误调试过程，用户报告给定输入的意外行为，机器协商程序失败的条件。在协商过程中，机器学习预测用户的反应，并在此过程中学习语义错误的自动oracle。在本文中，我们介绍Grammar2Fix，这是一种用于字符串处理程序的自动oracle学习和调试技术，即使输入格式未知。Grammar2Fix将oracle表示为常规语法，通过对用户的其他可能失败的输入进行系统查询来迭代改进。Grammar2Fix实现了几种启发式方法，以最小的查询预算最大化oracle质量。在我们使用3个广泛使用的修复基准集的实验中，Grammar2Fix使用对用户的42个查询的中位数，以超过96%的精度和召回率预测传递输入为通过，失败输入为失败。

{"title":"Human-in-the-loop oracle learning for semantic bugs in string processing programs","authors":"Charaka Geethal Kapugama, Van-Thuan Pham, A. Aleti, Marcel Böhme","doi":"10.1145/3533767.3534406","DOIUrl":"https://doi.org/10.1145/3533767.3534406","url":null,"abstract":"How can we automatically repair semantic bugs in string-processing programs? A semantic bug is an unexpected program state: The program does not crash (which can be easily detected). Instead, the program processes the input incorrectly. It produces an output which users identify as unexpected. We envision a fully automated debugging process for semantic bugs where a user reports the unexpected behavior for a given input and the machine negotiates the condition under which the program fails. During the negotiation, the machine learns to predict the user's response and in this process learns an automated oracle for semantic bugs. In this paper, we introduce Grammar2Fix, an automated oracle learning and debugging technique for string-processing programs even when the input format is unknown. Grammar2Fix represents the oracle as a regular grammar which is iteratively improved by systematic queries to the user for other inputs that are likely failing. Grammar2Fix implements several heuristics to maximize the oracle quality under a minimal query budget. In our experiments with 3 widely-used repair benchmark sets, Grammar2Fix predicts passing inputs as passing and failing inputs as failing with more than 96% precision and recall, using a median of 42 queries to the user.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116423007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

TeLL: log level suggestions via modeling multi-level code block information 告诉:日志级别的建议，通过建模多级代码块信息

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534379

Jiahao Liu, Jun Zeng, Xiang Wang, Kaihang Ji, Zhenkai Liang

Developers insert logging statements into source code to monitor system execution, which forms the basis for software debugging and maintenance. For distinguishing diverse runtime information, each software log is assigned with a separate verbosity level (e.g., trace and error). However, choosing an appropriate verbosity level is a challenging and error-prone task due to the lack of specifications for log level usages. Prior solutions aim to suggest log levels based on the code block in which a logging statement resides (i.e., intra-block features). Such suggestions, however, do not consider information from surrounding blocks (i.e., inter-block features), which also plays an important role in revealing logging characteristics. To address this issue, we combine multiple levels of code block information (i.e., intra-block and inter-block features) into a joint graph structure called Flow of Abstract Syntax Tree (FAST). To explicitly exploit multi-level block features, we design a new neural architecture, Hierarchical Block Graph Network (HBGN), on the FAST. In particular, it leverages graph neural networks to encode both the intra-block and inter-block features into code block representations and guide log level suggestions. We implement a prototype system, TeLL, and evaluate its effectiveness on nine large-scale software systems. Experimental results showcase TeLL's advantage in predicting log levels over the state-of-the-art approaches.

开发人员在源代码中插入日志语句以监视系统执行，这构成了软件调试和维护的基础。为了区分不同的运行时信息，每个软件日志都被分配了一个单独的冗长级别(例如，跟踪和错误)。然而，由于缺乏日志级别用法的规范，选择适当的冗长级别是一项具有挑战性且容易出错的任务。先前的解决方案旨在根据日志语句所在的代码块(即块内特性)建议日志级别。然而，这些建议没有考虑周围区块的信息(即区块间特征)，而这些信息在揭示测井特征方面也起着重要作用。为了解决这个问题，我们将多个级别的代码块信息(即，块内和块间特征)组合到一个称为抽象语法树流(FAST)的联合图结构中。为了明确地利用多层次块特征，我们在FAST上设计了一种新的神经结构——分层块图网络(HBGN)。特别是，它利用图神经网络将块内和块间特征编码为代码块表示并指导日志级建议。我们实现了一个原型系统，TeLL，并在九个大型软件系统上评估了它的有效性。实验结果表明，与最先进的方法相比，TeLL在预测日志级别方面具有优势。

{"title":"TeLL: log level suggestions via modeling multi-level code block information","authors":"Jiahao Liu, Jun Zeng, Xiang Wang, Kaihang Ji, Zhenkai Liang","doi":"10.1145/3533767.3534379","DOIUrl":"https://doi.org/10.1145/3533767.3534379","url":null,"abstract":"Developers insert logging statements into source code to monitor system execution, which forms the basis for software debugging and maintenance. For distinguishing diverse runtime information, each software log is assigned with a separate verbosity level (e.g., trace and error). However, choosing an appropriate verbosity level is a challenging and error-prone task due to the lack of specifications for log level usages. Prior solutions aim to suggest log levels based on the code block in which a logging statement resides (i.e., intra-block features). Such suggestions, however, do not consider information from surrounding blocks (i.e., inter-block features), which also plays an important role in revealing logging characteristics. To address this issue, we combine multiple levels of code block information (i.e., intra-block and inter-block features) into a joint graph structure called Flow of Abstract Syntax Tree (FAST). To explicitly exploit multi-level block features, we design a new neural architecture, Hierarchical Block Graph Network (HBGN), on the FAST. In particular, it leverages graph neural networks to encode both the intra-block and inter-block features into code block representations and guide log level suggestions. We implement a prototype system, TeLL, and evaluate its effectiveness on nine large-scale software systems. Experimental results showcase TeLL's advantage in predicting log levels over the state-of-the-art approaches.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122226258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Program vulnerability repair via inductive inference 通过归纳推理修复程序漏洞

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534387

Yuntong Zhang, Xiang Gao, Gregory J. Duck, Abhik Roychoudhury

Program vulnerabilities, even when detected and reported, are not fixed immediately. The time lag between the reporting and fixing of a vulnerability causes open-source software systems to suffer from significant exposure to possible attacks. In this paper, we propose a counter-example guided inductive inference procedure over program states to define likely invariants at possible fix locations. The likely invariants are constructed via mutation over states at the fix location, which turns out to be more effective for inductive property inference, as compared to the usual greybox fuzzing over program inputs. Once such likely invariants, which we call patch invariants, are identified, we can use them to construct patches via simple patch templates. Our work assumes that only one failing input (representing the exploit) is available to start the repair process. Experiments on the VulnLoc data-set of 39 vulnerabilities, which has been curated in previous works on vulnerability repair, show the effectiveness of our repair procedure. As compared to proposed approaches for vulnerability repair such as CPR or SenX which are based on concolic and symbolic execution respectively, we can repair significantly more vulnerabilities. Our results show the potential for program repair via inductive constraint inference, as opposed to generating repair constraints via deductive/symbolic analysis of a given test-suite.

即使发现并报告了程序漏洞，也不会立即修复。漏洞报告和修复之间的时间差导致开源软件系统严重暴露于可能的攻击之下。在本文中，我们提出了一个反例引导的归纳推理过程，在程序状态上定义可能固定位置的可能不变量。可能的不变量是通过对固定位置的状态进行突变来构建的，与对程序输入进行通常的灰盒模糊分析相比，这种方法对于归纳属性推理更有效。一旦确定了这种可能的不变量(我们称之为补丁不变量)，我们就可以使用它们通过简单的补丁模板来构建补丁。我们的工作假设只有一个失败的输入(表示漏洞利用)可用于启动修复过程。在先前漏洞修复工作中整理的39个漏洞的VulnLoc数据集上进行的实验表明，我们的修复过程是有效的。与基于共结肠执行和符号执行的CPR或SenX等漏洞修复方法相比，我们可以修复的漏洞明显更多。我们的结果显示了通过归纳约束推理来修复程序的潜力，而不是通过对给定测试套件的演绎/符号分析来生成修复约束。

{"title":"Program vulnerability repair via inductive inference","authors":"Yuntong Zhang, Xiang Gao, Gregory J. Duck, Abhik Roychoudhury","doi":"10.1145/3533767.3534387","DOIUrl":"https://doi.org/10.1145/3533767.3534387","url":null,"abstract":"Program vulnerabilities, even when detected and reported, are not fixed immediately. The time lag between the reporting and fixing of a vulnerability causes open-source software systems to suffer from significant exposure to possible attacks. In this paper, we propose a counter-example guided inductive inference procedure over program states to define likely invariants at possible fix locations. The likely invariants are constructed via mutation over states at the fix location, which turns out to be more effective for inductive property inference, as compared to the usual greybox fuzzing over program inputs. Once such likely invariants, which we call patch invariants, are identified, we can use them to construct patches via simple patch templates. Our work assumes that only one failing input (representing the exploit) is available to start the repair process. Experiments on the VulnLoc data-set of 39 vulnerabilities, which has been curated in previous works on vulnerability repair, show the effectiveness of our repair procedure. As compared to proposed approaches for vulnerability repair such as CPR or SenX which are based on concolic and symbolic execution respectively, we can repair significantly more vulnerabilities. Our results show the potential for program repair via inductive constraint inference, as opposed to generating repair constraints via deductive/symbolic analysis of a given test-suite.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121573611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Almost correct invariants: synthesizing inductive invariants by fuzzing proofs 几乎正确的不变量:用模糊证明综合归纳不变量

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534381

S. Lahiri, Subhajit Roy

Real-life programs contain multiple operations whose semantics are unavailable to verification engines, like third-party library calls, inline assembly and SIMD instructions, special compiler-provided primitives, and queries to uninterpretable machine learning models. Even with the exceptional success story of program verification, synthesis of inductive invariants for such "open" programs has remained a challenge. Currently, this problem is handled by manually "closing" the program---by providing hand-written stubs that attempt to capture the behavior of the unmodelled operations; writing stubs is not only difficult and tedious, but the stubs are often incorrect---raising serious questions on the whole endeavor. In this work, we propose Almost Correct Invariants as an automated strategy for synthesizing inductive invariants for such "open" programs. We adopt an active learning strategy where a data-driven learner proposes candidate invariants. In deviation from prior work that attempt to verify invariants, we attempt to falsify the invariants: we reduce the falsification problem to a set of reachability checks on non-deterministic programs; we ride on the success of modern fuzzers to answer these reachability queries. Our tool, Achar, automatically synthesizes inductive invariants that are sufficient to prove the correctness of the target programs. We compare Achar with a state-of-the-art invariant synthesis tool that employs theorem proving on formulae built over the program source. Though Achar is without strong soundness guarantees, our experiments show that even when we provide almost no access to the program source, Achar outperforms the state-of-the-art invariant generator that has complete access to the source. We also evaluate Achar on programs that current invariant synthesis engines cannot handle---programs that invoke external library calls, inline assembly, and queries to convolution neural networks; Achar successfully infers the necessary inductive invariants within a reasonable time.

现实生活中的程序包含多个操作，这些操作的语义对于验证引擎来说是不可用的，比如第三方库调用、内联汇编和SIMD指令、编译器提供的特殊原语，以及对不可解释的机器学习模型的查询。即使有了程序验证的成功案例，对这种“开放”程序的归纳不变量的综合仍然是一个挑战。目前，这个问题是通过手动“关闭”程序来处理的——通过提供试图捕获未建模操作行为的手写存根;写存根不仅困难和乏味，而且存根经常是不正确的——这会给整个努力带来严重的问题。在这项工作中，我们提出了几乎正确的不变量作为一种自动化的策略来综合归纳不变量对于这样的“开放”程序。我们采用主动学习策略，其中数据驱动的学习者提出候选不变量。与先前试图验证不变量的工作不同，我们试图证伪不变量:我们将证伪问题简化为一组非确定性程序的可达性检查;我们依靠现代模糊器的成功来回答这些可达性问题。我们的工具Achar自动合成归纳不变量，这些不变量足以证明目标程序的正确性。我们将Achar与最先进的不变综合工具进行比较，该工具在程序源上构建的公式上使用定理证明。虽然Achar没有很强的可靠性保证，但我们的实验表明，即使我们几乎不提供对程序源代码的访问，Achar的性能也优于最先进的不变量生成器，后者可以完全访问源代码。我们还在当前不变合成引擎无法处理的程序上评估Achar——调用外部库调用的程序，内联汇编，以及对卷积神经网络的查询;Achar在合理的时间内成功地推导出了必要的归纳不变量。

{"title":"Almost correct invariants: synthesizing inductive invariants by fuzzing proofs","authors":"S. Lahiri, Subhajit Roy","doi":"10.1145/3533767.3534381","DOIUrl":"https://doi.org/10.1145/3533767.3534381","url":null,"abstract":"Real-life programs contain multiple operations whose semantics are unavailable to verification engines, like third-party library calls, inline assembly and SIMD instructions, special compiler-provided primitives, and queries to uninterpretable machine learning models. Even with the exceptional success story of program verification, synthesis of inductive invariants for such \"open\" programs has remained a challenge. Currently, this problem is handled by manually \"closing\" the program---by providing hand-written stubs that attempt to capture the behavior of the unmodelled operations; writing stubs is not only difficult and tedious, but the stubs are often incorrect---raising serious questions on the whole endeavor. In this work, we propose Almost Correct Invariants as an automated strategy for synthesizing inductive invariants for such \"open\" programs. We adopt an active learning strategy where a data-driven learner proposes candidate invariants. In deviation from prior work that attempt to verify invariants, we attempt to falsify the invariants: we reduce the falsification problem to a set of reachability checks on non-deterministic programs; we ride on the success of modern fuzzers to answer these reachability queries. Our tool, Achar, automatically synthesizes inductive invariants that are sufficient to prove the correctness of the target programs. We compare Achar with a state-of-the-art invariant synthesis tool that employs theorem proving on formulae built over the program source. Though Achar is without strong soundness guarantees, our experiments show that even when we provide almost no access to the program source, Achar outperforms the state-of-the-art invariant generator that has complete access to the source. We also evaluate Achar on programs that current invariant synthesis engines cannot handle---programs that invoke external library calls, inline assembly, and queries to convolution neural networks; Achar successfully infers the necessary inductive invariants within a reasonable time.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133087251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A large-scale empirical analysis of the vulnerabilities introduced by third-party components in IoT firmware 对物联网固件中第三方组件引入的漏洞进行大规模实证分析

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534366

Binbin Zhao, S. Ji, Jiacheng Xu, Yuan Tian, Qiuyang Wei, Qinying Wang, Chenyang Lyu, Xuhong Zhang, Changting Lin, Jingzheng Wu, R. Beyah

As the core of IoT devices, firmware is undoubtedly vital. Currently, the development of IoT firmware heavily depends on third-party components (TPCs), which significantly improves the development efficiency and reduces the cost. Nevertheless, TPCs are not secure, and the vulnerabilities in TPCs will turn back influence the security of IoT firmware. Currently, existing works pay less attention to the vulnerabilities caused by TPCs, and we still lack a comprehensive understanding of the security impact of TPC vulnerability against firmware. To fill in the knowledge gap, we design and implement FirmSec, which leverages syntactical features and control-flow graph features to detect the TPCs at version-level in firmware, and then recognizes the corresponding vulnerabilities. Based on FirmSec, we present the first large-scale analysis of the usage of TPCs and the corresponding vulnerabilities in firmware. More specifically, we perform an analysis on 34,136 firmware images, including 11,086 publicly accessible firmware images, and 23,050 private firmware images from TSmart. We successfully detect 584 TPCs and identify 128,757 vulnerabilities caused by 429 CVEs. Our in-depth analysis reveals the diversity of security issues for different kinds of firmware from various vendors, and discovers some well-known vulnerabilities are still deeply rooted in many firmware images. We also find that the TPCs used in firmware have fallen behind by five years on average. Besides, we explore the geographical distribution of vulnerable devices, and confirm the security situation of devices in several regions, e.g., South Korea and China, is more severe than in other regions. Further analysis shows 2,478 commercial firmware images have potentially violated GPL/AGPL licensing terms.

作为物联网设备的核心，固件无疑是至关重要的。目前，物联网固件的开发严重依赖第三方组件(tpc)，这大大提高了开发效率，降低了成本。然而，tpc并不安全，tpc中的漏洞会反过来影响物联网固件的安全性。目前，现有工作对TPC漏洞的关注较少，对TPC漏洞对固件的安全影响还缺乏全面的认识。为了填补知识空白，我们设计并实现了FirmSec，它利用语法特征和控制流图特征来检测固件中的版本级tpc，然后识别相应的漏洞。基于FirmSec，我们首次对tpc的使用情况和固件中相应的漏洞进行了大规模分析。更具体地说，我们对34,136个固件映像进行了分析，其中包括11,086个公开访问的固件映像，以及来自TSmart的23,050个私有固件映像。我们成功检测出584个tpc，识别出429个cve导致的128,757个漏洞。我们的深入分析揭示了来自不同供应商的不同类型固件的安全问题的多样性，并发现一些众所周知的漏洞仍然深深植根于许多固件映像中。我们还发现固件中使用的tpc平均落后了5年。此外，我们还对脆弱设备的地理分布进行了探索，确认了韩国、中国等几个地区的设备安全状况比其他地区更为严重。进一步分析显示，2478个商业固件镜像可能违反了GPL/AGPL许可条款。

{"title":"A large-scale empirical analysis of the vulnerabilities introduced by third-party components in IoT firmware","authors":"Binbin Zhao, S. Ji, Jiacheng Xu, Yuan Tian, Qiuyang Wei, Qinying Wang, Chenyang Lyu, Xuhong Zhang, Changting Lin, Jingzheng Wu, R. Beyah","doi":"10.1145/3533767.3534366","DOIUrl":"https://doi.org/10.1145/3533767.3534366","url":null,"abstract":"As the core of IoT devices, firmware is undoubtedly vital. Currently, the development of IoT firmware heavily depends on third-party components (TPCs), which significantly improves the development efficiency and reduces the cost. Nevertheless, TPCs are not secure, and the vulnerabilities in TPCs will turn back influence the security of IoT firmware. Currently, existing works pay less attention to the vulnerabilities caused by TPCs, and we still lack a comprehensive understanding of the security impact of TPC vulnerability against firmware. To fill in the knowledge gap, we design and implement FirmSec, which leverages syntactical features and control-flow graph features to detect the TPCs at version-level in firmware, and then recognizes the corresponding vulnerabilities. Based on FirmSec, we present the first large-scale analysis of the usage of TPCs and the corresponding vulnerabilities in firmware. More specifically, we perform an analysis on 34,136 firmware images, including 11,086 publicly accessible firmware images, and 23,050 private firmware images from TSmart. We successfully detect 584 TPCs and identify 128,757 vulnerabilities caused by 429 CVEs. Our in-depth analysis reveals the diversity of security issues for different kinds of firmware from various vendors, and discovers some well-known vulnerabilities are still deeply rooted in many firmware images. We also find that the TPCs used in firmware have fallen behind by five years on average. Besides, we explore the geographical distribution of vulnerable devices, and confirm the security situation of devices in several regions, e.g., South Korea and China, is more severe than in other regions. Further analysis shows 2,478 commercial firmware images have potentially violated GPL/AGPL licensing terms.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134194046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Test mimicry to assess the exploitability of library vulnerabilities 测试模拟以评估库漏洞的可利用性

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534398

Hong Jin Kang, Truong-Giang Nguyen, Bach Le, C. Pasareanu, D. Lo

Modern software engineering projects often depend on open-source software libraries, rendering them vulnerable to potential security issues in these libraries. Developers of client projects have to stay alert of security threats in the software dependencies. While there are existing tools that allow developers to assess if a library vulnerability is reachable from a project, they face limitations. Call graph-only approaches may produce false alarms as the client project may not use the vulnerable code in a way that triggers the vulnerability, while test generation-based approaches faces difficulties in overcoming the intrinsic complexity of exploiting a vulnerability, where extensive domain knowledge may be required to produce a vulnerability-triggering input. In this work, we propose a new framework named Test Mimicry, that constructs a test case for a client project that exploits a vulnerability in its library dependencies. Given a test case in a software library that reveals a vulnerability, our approach captures the program state associated with the vulnerability. Then, it guides test generation to construct a test case for the client program to invoke the library such that it reaches the same program state as the library's test case. Our framework is implemented in a tool, TRANSFER, which uses search-based test generation. Based on the library's test case, we produce search goals that represent the program state triggering the vulnerability. Our empirical evaluation on 22 real library vulnerabilities and 64 client programs shows that TRANSFER outperforms an existing approach, SIEGE; TRANSFER generates 4x more test cases that demonstrate the exploitability of vulnerabilities from client projects than SIEGE.

现代软件工程项目通常依赖于开源软件库，这使得它们容易受到这些库中潜在安全问题的影响。客户端项目的开发人员必须对软件依赖项中的安全威胁保持警惕。虽然有现有的工具允许开发人员评估是否可以从项目中访问库漏洞，但它们面临限制。仅调用图的方法可能会产生假警报，因为客户项目可能不会以触发漏洞的方式使用易受攻击的代码，而基于测试生成的方法在克服利用漏洞的内在复杂性方面面临困难，其中可能需要广泛的领域知识来产生触发漏洞的输入。在这项工作中，我们提出了一个名为Test Mimicry的新框架，它为利用库依赖中的漏洞的客户端项目构建了一个测试用例。给定一个软件库中的测试用例，它揭示了一个漏洞，我们的方法捕获与该漏洞相关的程序状态。然后，它指导测试生成，为客户机程序构造一个测试用例，以调用库，使其达到与库的测试用例相同的程序状态。我们的框架是在一个工具中实现的，TRANSFER，它使用基于搜索的测试生成。基于库的测试用例，我们生成了表示触发漏洞的程序状态的搜索目标。我们对22个真实库漏洞和64个客户端程序的实证评估表明，TRANSFER优于现有的方法SIEGE;TRANSFER生成的测试用例比SIEGE多4倍，这些测试用例展示了客户端项目中漏洞的可利用性。

{"title":"Test mimicry to assess the exploitability of library vulnerabilities","authors":"Hong Jin Kang, Truong-Giang Nguyen, Bach Le, C. Pasareanu, D. Lo","doi":"10.1145/3533767.3534398","DOIUrl":"https://doi.org/10.1145/3533767.3534398","url":null,"abstract":"Modern software engineering projects often depend on open-source software libraries, rendering them vulnerable to potential security issues in these libraries. Developers of client projects have to stay alert of security threats in the software dependencies. While there are existing tools that allow developers to assess if a library vulnerability is reachable from a project, they face limitations. Call graph-only approaches may produce false alarms as the client project may not use the vulnerable code in a way that triggers the vulnerability, while test generation-based approaches faces difficulties in overcoming the intrinsic complexity of exploiting a vulnerability, where extensive domain knowledge may be required to produce a vulnerability-triggering input. In this work, we propose a new framework named Test Mimicry, that constructs a test case for a client project that exploits a vulnerability in its library dependencies. Given a test case in a software library that reveals a vulnerability, our approach captures the program state associated with the vulnerability. Then, it guides test generation to construct a test case for the client program to invoke the library such that it reaches the same program state as the library's test case. Our framework is implemented in a tool, TRANSFER, which uses search-based test generation. Based on the library's test case, we produce search goals that represent the program state triggering the vulnerability. Our empirical evaluation on 22 real library vulnerabilities and 64 client programs shows that TRANSFER outperforms an existing approach, SIEGE; TRANSFER generates 4x more test cases that demonstrate the exploitability of vulnerabilities from client projects than SIEGE.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115117308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Park: accelerating smart contract vulnerability detection via parallel-fork symbolic execution Park:通过并行分叉符号执行加速智能合约漏洞检测

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534395

Peilin Zheng, Zibin Zheng, Xiapu Luo

Symbolic detection has been widely used to detect vulnerabilities in smart contracts. Unfortunately, as reported, existing symbolic tools cost too much time, since they need to execute all paths to detect vulnerabilities. Thus, their accuracy is limited by time. To tackle this problem, in this paper, we propose Park, the first general framework of parallel-fork symbolic execution for smart contracts. The main idea is to use multiple processes during symbolic execution, leveraging multiple CPU cores to enhance efficiency. Firstly, we propose a fork-operation based dynamic forking algorithm to achieve parallel symbolic contract execution. Secondly, to address the SMT performance loss problem in parallelization, we propose an adaptive processes restriction and adjustment algorithm. Thirdly, we design a shared-memory based global variable reconstruction method to collect and rebuild the global variables from different processes. We implement Park as a plug-in and apply it to two popular symbolic execution tools for smart contracts: Oyente and Mythril. The experimental results with third-party datasets show that Park-Oyente and Park-Mythril can provide up to 6.84x and 7.06x speedup compared to original tools, respectively.

符号检测已被广泛用于检测智能合约中的漏洞。不幸的是，据报道，现有的符号工具花费了太多的时间，因为它们需要执行所有路径来检测漏洞。因此，它们的准确性受到时间的限制。为了解决这个问题，在本文中，我们提出了Park，这是智能合约并行分叉符号执行的第一个通用框架。其主要思想是在符号执行期间使用多个进程，利用多个CPU内核来提高效率。首先，提出了一种基于分叉操作的动态分叉算法，实现了符号合约的并行执行。其次，针对并行化过程中SMT性能损失问题，提出了一种自适应进程约束与调整算法。第三，设计了一种基于共享内存的全局变量重构方法，从不同进程中收集和重构全局变量。我们将Park作为插件实现，并将其应用于两种流行的智能合约符号执行工具:Oyente和Mythril。第三方数据集的实验结果表明，与原始工具相比，Park-Oyente和Park-Mythril分别可以提供高达6.84倍和7.06倍的加速。

{"title":"Park: accelerating smart contract vulnerability detection via parallel-fork symbolic execution","authors":"Peilin Zheng, Zibin Zheng, Xiapu Luo","doi":"10.1145/3533767.3534395","DOIUrl":"https://doi.org/10.1145/3533767.3534395","url":null,"abstract":"Symbolic detection has been widely used to detect vulnerabilities in smart contracts. Unfortunately, as reported, existing symbolic tools cost too much time, since they need to execute all paths to detect vulnerabilities. Thus, their accuracy is limited by time. To tackle this problem, in this paper, we propose Park, the first general framework of parallel-fork symbolic execution for smart contracts. The main idea is to use multiple processes during symbolic execution, leveraging multiple CPU cores to enhance efficiency. Firstly, we propose a fork-operation based dynamic forking algorithm to achieve parallel symbolic contract execution. Secondly, to address the SMT performance loss problem in parallelization, we propose an adaptive processes restriction and adjustment algorithm. Thirdly, we design a shared-memory based global variable reconstruction method to collect and rebuild the global variables from different processes. We implement Park as a plug-in and apply it to two popular symbolic execution tools for smart contracts: Oyente and Mythril. The experimental results with third-party datasets show that Park-Oyente and Park-Mythril can provide up to 6.84x and 7.06x speedup compared to original tools, respectively.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125927054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2