Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis最新文献_第5页

The raise of machine learning hyperparameter constraints in Python code Python代码中机器学习超参数约束的提升

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534400

Ingkarat Rak-amnouykit, Ana L. Milanova, Guillaume Baudart, Martin Hirzel, Julian Dolby

Machine-learning operators often have correctness constraints that cut across multiple hyperparameters and/or data. Violating these constraints causes the operator to raise runtime exceptions, but those are usually documented only informally or not at all. This paper presents the first interprocedural weakest-precondition analysis for Python to extract hyperparameter constraints. The analysis is mostly static, but to make it tractable for typical Python idioms in machine-learning libraries, it selectively switches to the concrete domain for some cases. This paper demonstrates the analysis by extracting hyperparameter constraints for 181 operators from a total of 8 ML libraries, where it achieved high precision and recall and found real bugs. Our technique advances static analysis for Python and is a step towards safer and more robust machine learning.

机器学习操作符通常具有跨多个超参数和/或数据的正确性约束。违反这些约束会导致操作符引发运行时异常，但这些异常通常只是非正式地记录下来，或者根本没有记录下来。本文首次提出了Python超参数约束提取的过程间最弱前提分析方法。分析主要是静态的，但为了使其易于处理机器学习库中的典型Python习惯用法，它在某些情况下选择性地切换到具体领域。本文通过从总共8个ML库中提取181个操作符的超参数约束来演示分析，获得了较高的精度和召回率，并发现了真正的错误。我们的技术促进了Python的静态分析，是迈向更安全、更健壮的机器学习的一步。

引用次数: 1

iFixDataloss: a tool for detecting and fixing data loss issues in Android apps iFixDataloss:用于检测和修复Android应用程序中的数据丢失问题的工具

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3543297

Wunan Guo, Zhen Dong, Liwei Shen, Wei Tian, Ting Su, Xin Peng

Android apps are event-driven, and their execution is often interrupted by external events. This interruption can cause data loss issues that annoy users. For instance, when the screen is rotated, the current app page will be destroyed and recreated. If the app state is improperly preserved, user data will be lost. In this work, we present a tool iFixDataloss that automatically detects and fixes data loss issues in Android apps. To achieve this, we identify scenarios in which data loss issues may occur by analyzing the Android life cycle, developing strategies to reveal data loss issues, and designing patch templates to fix them. Our experiments on 66 Android apps show iFixDataloss detected 374 data loss issues (284 of them were previously unknown) and successfully generated patches for 188 of the 374 issues. Out of 20 submitted patches, 16 have been accepted by developers. In comparison with state-of-the-art techniques, iFixDataloss performed significantly better in terms of the number of detected data loss issues and the quality of generated patches. Video Link: https://www.youtube.com/watch?v=MAPsCo-dRKs Github Link: https://github.com/iFixDataLoss/iFixDataloss22

Android应用程序是事件驱动的，它们的执行经常被外部事件打断。这种中断可能会导致数据丢失问题，从而惹恼用户。例如，当屏幕旋转时，当前的应用页面将被销毁并重新创建。如果app状态保存不当，将会导致用户数据丢失。在这项工作中，我们提出了一个工具iFixDataloss，自动检测和修复Android应用程序中的数据丢失问题。为了实现这一目标，我们通过分析Android生命周期、开发揭示数据丢失问题的策略以及设计补丁模板来确定可能发生数据丢失问题的场景。我们在66个Android应用上的实验显示，iFixDataloss检测到374个数据丢失问题(其中284个以前是未知的)，并成功为374个问题中的188个生成了补丁。在提交的20个补丁中，有16个已被开发人员接受。与最先进的技术相比，iFixDataloss在检测到的数据丢失问题的数量和生成的补丁的质量方面表现得更好。视频链接:https://www.youtube.com/watch?v=MAPsCo-dRKs Github链接:https://github.com/iFixDataLoss/iFixDataloss22

{"title":"iFixDataloss: a tool for detecting and fixing data loss issues in Android apps","authors":"Wunan Guo, Zhen Dong, Liwei Shen, Wei Tian, Ting Su, Xin Peng","doi":"10.1145/3533767.3543297","DOIUrl":"https://doi.org/10.1145/3533767.3543297","url":null,"abstract":"Android apps are event-driven, and their execution is often interrupted by external events. This interruption can cause data loss issues that annoy users. For instance, when the screen is rotated, the current app page will be destroyed and recreated. If the app state is improperly preserved, user data will be lost. In this work, we present a tool iFixDataloss that automatically detects and fixes data loss issues in Android apps. To achieve this, we identify scenarios in which data loss issues may occur by analyzing the Android life cycle, developing strategies to reveal data loss issues, and designing patch templates to fix them. Our experiments on 66 Android apps show iFixDataloss detected 374 data loss issues (284 of them were previously unknown) and successfully generated patches for 188 of the 374 issues. Out of 20 submitted patches, 16 have been accepted by developers. In comparison with state-of-the-art techniques, iFixDataloss performed significantly better in terms of the number of detected data loss issues and the quality of generated patches. Video Link: https://www.youtube.com/watch?v=MAPsCo-dRKs Github Link: https://github.com/iFixDataLoss/iFixDataloss22","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126994153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

LiRTest: augmenting LiDAR point clouds for automated testing of autonomous driving systems LiRTest:增强激光雷达点云，用于自动驾驶系统的自动测试

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534397

An Guo, Yang Feng, Zhenyu Chen

With the tremendous advancement of Deep Neural Networks (DNNs), autonomous driving systems (ADS) have achieved significant development and been applied to assist in many safety-critical tasks. However, despite their spectacular progress, several real-world accidents involving autonomous cars even resulted in a fatality. While the high complexity and low interpretability of DNN models, which empowers the perception capability of ADS, make conventional testing techniques inapplicable for the perception of ADS, the existing testing techniques depending on manual data collection and labeling become time-consuming and prohibitively expensive. In this paper, we design and implement LiRTest, the first automated LiDAR-based autonomous vehicles testing tool. LiRTest implements the ADS-specific metamorphic relation and equips affine and weather transformation operators that can reflect the impact of the various environmental factors to implement the relation. We experiment LiRTest with multiple 3D object detection models to evaluate its performance on different tasks. The experiment results show that LiRTest can activate different neurons of the object detection models and effectively detect their erroneous behaviors under various driving conditions. Also, the results confirm that LiRTest can improve the object detection precision by retraining with the generated data.

随着深度神经网络(dnn)的巨大进步，自动驾驶系统(ADS)取得了重大发展，并被应用于辅助许多安全关键任务。然而，尽管取得了惊人的进展，但在现实世界中，几起涉及自动驾驶汽车的事故甚至导致了死亡。虽然DNN模型的高复杂性和低可解释性赋予了ADS的感知能力，但传统的测试技术并不适用于ADS的感知，现有的测试技术依赖于手动数据收集和标记，变得耗时且昂贵。在本文中，我们设计并实现了LiRTest，这是第一个基于lidar的自动驾驶汽车测试工具。LiRTest实现ads特有的变质关系，并配备能够反映各种环境因素影响的仿射和天气变换算子来实现该关系。我们在多个3D物体检测模型上进行了LiRTest实验，以评估其在不同任务上的性能。实验结果表明，LiRTest可以激活目标检测模型的不同神经元，并在各种驾驶条件下有效检测其错误行为。同时，实验结果也证实了LiRTest可以通过对生成的数据进行再训练来提高目标检测的精度。

{"title":"LiRTest: augmenting LiDAR point clouds for automated testing of autonomous driving systems","authors":"An Guo, Yang Feng, Zhenyu Chen","doi":"10.1145/3533767.3534397","DOIUrl":"https://doi.org/10.1145/3533767.3534397","url":null,"abstract":"With the tremendous advancement of Deep Neural Networks (DNNs), autonomous driving systems (ADS) have achieved significant development and been applied to assist in many safety-critical tasks. However, despite their spectacular progress, several real-world accidents involving autonomous cars even resulted in a fatality. While the high complexity and low interpretability of DNN models, which empowers the perception capability of ADS, make conventional testing techniques inapplicable for the perception of ADS, the existing testing techniques depending on manual data collection and labeling become time-consuming and prohibitively expensive. In this paper, we design and implement LiRTest, the first automated LiDAR-based autonomous vehicles testing tool. LiRTest implements the ADS-specific metamorphic relation and equips affine and weather transformation operators that can reflect the impact of the various environmental factors to implement the relation. We experiment LiRTest with multiple 3D object detection models to evaluate its performance on different tasks. The experiment results show that LiRTest can activate different neurons of the object detection models and effectively detect their erroneous behaviors under various driving conditions. Also, the results confirm that LiRTest can improve the object detection precision by retraining with the generated data.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132332061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

ATUA: an update-driven app testing tool ATUA:一个更新驱动的应用测试工具

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3543293

C. Ngo, F. Pastore, L. Briand

App testing tools tend to generate thousand test inputs; they help engineers identify crashing conditions but not functional failures. Indeed, detecting functional failures requires the visual inspection of App outputs, which is infeasible for thousands of inputs. Existing App testing tools ignore that most of the Apps are frequently updated and engineers are mainly interested in testing the updated functionalities; indeed, automated regression test cases can be used otherwise. We present ATUA, an open source tool targeting Android Apps. It achieves high coverage of the updated App code with a small number of test inputs, thus alleviating the test oracle problem (less outputs to inspect). It implements a model-based approach that synthesizes App models with static analysis, integrates a dynamically-refined state abstraction function and combines complementary testing strategies, including (1) coverage of the model structure, (2) coverage of the App code, (3) random exploration, and (4) coverage of dependencies identified through information retrieval. Our empirical evaluation, conducted with nine popular Android Apps (72 versions), has shown that ATUA, compared to state-of-the-art approaches, achieves higher code coverage while producing fewer outputs to be manually inspected. A demo video is available at https://youtu.be/RqQ1z_Nkaqo.

应用测试工具往往会产生上千个测试输入;它们可以帮助工程师识别坠机情况，但不能识别功能故障。事实上，检测功能故障需要对App输出进行视觉检查，这对于数千个输入来说是不可行的。现有的应用程序测试工具忽略了大多数应用程序是经常更新的，工程师主要对测试更新的功能感兴趣;实际上，自动化回归测试用例可以用其他方式使用。我们介绍ATUA，一个针对Android应用程序的开源工具。它以少量的测试输入实现了更新的App代码的高覆盖率，从而缓解了测试oracle问题(较少的要检查的输出)。它实现了一种基于模型的方法，将应用模型与静态分析相结合，集成了一个动态细化的状态抽象功能，并结合了互补的测试策略，包括(1)模型结构的覆盖，(2)应用代码的覆盖，(3)随机探索，(4)通过信息检索识别的依赖关系的覆盖。我们对9个流行的Android应用程序(72个版本)进行的实证评估表明，与最先进的方法相比，ATUA实现了更高的代码覆盖率，同时产生更少的需要手工检查的输出。演示视频可在https://youtu.be/RqQ1z_Nkaqo上获得。

{"title":"ATUA: an update-driven app testing tool","authors":"C. Ngo, F. Pastore, L. Briand","doi":"10.1145/3533767.3543293","DOIUrl":"https://doi.org/10.1145/3533767.3543293","url":null,"abstract":"App testing tools tend to generate thousand test inputs; they help engineers identify crashing conditions but not functional failures. Indeed, detecting functional failures requires the visual inspection of App outputs, which is infeasible for thousands of inputs. Existing App testing tools ignore that most of the Apps are frequently updated and engineers are mainly interested in testing the updated functionalities; indeed, automated regression test cases can be used otherwise. We present ATUA, an open source tool targeting Android Apps. It achieves high coverage of the updated App code with a small number of test inputs, thus alleviating the test oracle problem (less outputs to inspect). It implements a model-based approach that synthesizes App models with static analysis, integrates a dynamically-refined state abstraction function and combines complementary testing strategies, including (1) coverage of the model structure, (2) coverage of the App code, (3) random exploration, and (4) coverage of dependencies identified through information retrieval. Our empirical evaluation, conducted with nine popular Android Apps (72 versions), has shown that ATUA, compared to state-of-the-art approaches, achieves higher code coverage while producing fewer outputs to be manually inspected. A demo video is available at https://youtu.be/RqQ1z_Nkaqo.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132000878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

SpecChecker-ISA: a data sharing analyzer for interrupt-driven embedded software SpecChecker-ISA:用于中断驱动嵌入式软件的数据共享分析器

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3543295

Boxiang Wang, R. Chen, Chao Li, Tingting Yu, Dongdong Gao, Mengfei Yang

Concurrency bugs are common in interrupt-driven programs, which are widely used in safety-critical areas. These bugs are often caused by incorrect data sharing among tasks and interrupts. Therefore, data sharing analysis is crucial to reason about the concurrency behaviours of interrupt-driven programs. Due to the variety of data access forms, existing tools suffer from both extensive false positives and false negatives while applying to interrupt-driven programs. This paper presents SpecChecker-ISA, a tool that provides sound and precise data sharing analysis for interrupt-driven embedded software. The tool uses a memory access model parameterized by numerical invariants, which are computed by abstract interpretation based value analysis, to describe data accesses of various kinds, and then uses numerical meet operations to obtain the final result of data sharing. Our experiments on 4 real-world aerospace embedded software show that SpecChecker-ISA can find all shared data accesses with few false positives, significantly outperforming other existing tools. The demo can be accessed at https://github.com/wangilson/specchecker-isa.

并发错误在中断驱动程序中很常见，中断驱动程序广泛用于安全关键领域。这些错误通常是由任务和中断之间不正确的数据共享引起的。因此，数据共享分析对于推断中断驱动程序的并发行为至关重要。由于数据访问形式的多样性，现有的工具在应用于中断驱动的程序时存在大量的误报和误报。本文介绍了SpecChecker-ISA，一个为中断驱动的嵌入式软件提供健全和精确的数据共享分析的工具。该工具采用数值不变量参数化的内存访问模型，通过基于抽象解释的值分析计算，来描述各种类型的数据访问，然后使用数值满足运算来获得数据共享的最终结果。我们在4个真实的航空航天嵌入式软件上的实验表明，SpecChecker-ISA可以发现所有共享的数据访问，几乎没有误报，明显优于其他现有工具。该演示可以在https://github.com/wangilson/specchecker-isa上访问。

{"title":"SpecChecker-ISA: a data sharing analyzer for interrupt-driven embedded software","authors":"Boxiang Wang, R. Chen, Chao Li, Tingting Yu, Dongdong Gao, Mengfei Yang","doi":"10.1145/3533767.3543295","DOIUrl":"https://doi.org/10.1145/3533767.3543295","url":null,"abstract":"Concurrency bugs are common in interrupt-driven programs, which are widely used in safety-critical areas. These bugs are often caused by incorrect data sharing among tasks and interrupts. Therefore, data sharing analysis is crucial to reason about the concurrency behaviours of interrupt-driven programs. Due to the variety of data access forms, existing tools suffer from both extensive false positives and false negatives while applying to interrupt-driven programs. This paper presents SpecChecker-ISA, a tool that provides sound and precise data sharing analysis for interrupt-driven embedded software. The tool uses a memory access model parameterized by numerical invariants, which are computed by abstract interpretation based value analysis, to describe data accesses of various kinds, and then uses numerical meet operations to obtain the final result of data sharing. Our experiments on 4 real-world aerospace embedded software show that SpecChecker-ISA can find all shared data accesses with few false positives, significantly outperforming other existing tools. The demo can be accessed at https://github.com/wangilson/specchecker-isa.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122254776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

One step further: evaluating interpreters using metamorphic testing 更进一步:使用变形测试评估解释器

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534225

Ming Fan, Jiali Wei, Wuxia Jin, Zhou Xu, Wenying Wei, Ting Liu

The black-box nature of the Deep Neural Network (DNN) makes it difficult for people to understand why it makes a specific decision, which restricts its applications in critical tasks. Recently, many interpreters (interpretation methods) are proposed to improve the transparency of DNNs by providing relevant features in the form of a saliency map. However, different interpreters might provide different interpretation results for the same classification case, which motivates us to conduct the robustness evaluation of interpreters. However, the biggest challenge of evaluating interpreters is the testing oracle problem, i.e., hard to label ground-truth interpretation results. To fill this critical gap, we first use the images with bounding boxes in the object detection system and the images inserted with backdoor triggers as our original ground-truth dataset. Then, we apply metamorphic testing to extend the dataset by three operators, including inserting an object, deleting an object, and feature squeezing the image background. Our key intuition is that after the three operations which do not modify the primary detected objects, the interpretation results should not change for good interpreters. Finally, we measure the qualities of interpretation results quantitatively with the Intersection-over-Minimum (IoMin) score and evaluate interpreters based on the statistics of metamorphic relation's failures. We evaluate seven popular interpreters on 877,324 metamorphic images in diverse scenes. The results show that our approach can quantitatively evaluate interpreters' robustness, where Grad-CAM provides the most reliable interpretation results among the seven interpreters.

深度神经网络(DNN)的黑箱特性使得人们很难理解它为什么会做出特定的决定，这限制了它在关键任务中的应用。最近，人们提出了许多解释器(解释方法)，通过以显著性图的形式提供相关特征来提高dnn的透明度。然而，对于同一分类案例，不同的译员可能提供不同的翻译结果，这促使我们对译员进行鲁棒性评估。然而，评估解释器的最大挑战是测试oracle问题，即很难标记ground-truth解释结果。为了填补这一关键空白，我们首先使用目标检测系统中带有边界框的图像和插入后门触发器的图像作为原始的地面真实数据集。然后，我们采用变形测试方法，通过插入对象、删除对象和特征压缩图像背景三种操作对数据集进行扩展。我们的关键直觉是，在三个不修改主要检测对象的操作之后，对于好的解释器来说，解释结果应该不会改变。最后，我们用最小交集(IoMin)分数定量地衡量口译结果的质量，并基于变质关系失效的统计对口译员进行评价。我们在不同场景的877,324张变形图像上评价了7个流行的解释器。结果表明，我们的方法可以定量评价口译员的稳健性，其中Grad-CAM在7名口译员中提供了最可靠的口译结果。

{"title":"One step further: evaluating interpreters using metamorphic testing","authors":"Ming Fan, Jiali Wei, Wuxia Jin, Zhou Xu, Wenying Wei, Ting Liu","doi":"10.1145/3533767.3534225","DOIUrl":"https://doi.org/10.1145/3533767.3534225","url":null,"abstract":"The black-box nature of the Deep Neural Network (DNN) makes it difficult for people to understand why it makes a specific decision, which restricts its applications in critical tasks. Recently, many interpreters (interpretation methods) are proposed to improve the transparency of DNNs by providing relevant features in the form of a saliency map. However, different interpreters might provide different interpretation results for the same classification case, which motivates us to conduct the robustness evaluation of interpreters. However, the biggest challenge of evaluating interpreters is the testing oracle problem, i.e., hard to label ground-truth interpretation results. To fill this critical gap, we first use the images with bounding boxes in the object detection system and the images inserted with backdoor triggers as our original ground-truth dataset. Then, we apply metamorphic testing to extend the dataset by three operators, including inserting an object, deleting an object, and feature squeezing the image background. Our key intuition is that after the three operations which do not modify the primary detected objects, the interpretation results should not change for good interpreters. Finally, we measure the qualities of interpretation results quantitatively with the Intersection-over-Minimum (IoMin) score and evaluate interpreters based on the statistics of metamorphic relation's failures. We evaluate seven popular interpreters on 877,324 metamorphic images in diverse scenes. The results show that our approach can quantitatively evaluate interpreters' robustness, where Grad-CAM provides the most reliable interpretation results among the seven interpreters.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114734988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An extensive study on pre-trained models for program understanding and generation 对程序理解和生成的预训练模型进行了广泛的研究

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534390

Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, Lingming Zhang

Automatic program understanding and generation techniques could significantly advance the productivity of programmers and have been widely studied by academia and industry. Recently, the advent of pre-trained paradigm enlightens researchers to develop general-purpose pre-trained models which can be applied for a broad range of program understanding and generation tasks. Such pre-trained models, derived by self-supervised objectives on large unlabelled corpora, can be fine-tuned in downstream tasks (such as code search and code generation) with minimal adaptations. Although these pre-trained models claim superiority over the prior techniques, they seldom follow equivalent evaluation protocols, e.g., they are hardly evaluated on the identical benchmarks, tasks, or settings. Consequently, there is a pressing need for a comprehensive study of the pre-trained models on their effectiveness, versatility as well as the limitations to provide implications and guidance for the future development in this area. To this end, we first perform an extensive study of eight open-access pre-trained models over a large benchmark on seven representative code tasks to assess their reproducibility. We further compare the pre-trained models and domain-specific state-of-the-art techniques for validating pre-trained effectiveness. At last, we investigate the robustness of the pre-trained models by inspecting their performance variations under adversarial attacks. Through the study, we find that while we can in general replicate the original performance of the pre-trained models on their evaluated tasks and adopted benchmarks, subtle performance fluctuations can refute the findings in their original papers. Moreover, none of the existing pre-trained models can dominate over all other models. We also find that the pre-trained models can significantly outperform non-pre-trained state-of-the-art techniques in program understanding tasks. Furthermore, we perform the first study for natural language-programming language pre-trained model robustness via adversarial attacks and find that a simple random attack approach can easily fool the state-of-the-art pre-trained models and thus incur security issues. At last, we also provide multiple practical guidelines for advancing future research on pre-trained models for program understanding and generation.

自动程序理解和生成技术可以显著提高程序员的生产力，并已被学术界和工业界广泛研究。近年来，预训练范式的出现启发了研究人员开发通用的预训练模型，这些模型可以应用于广泛的程序理解和生成任务。这种预先训练的模型是由大型未标记语料库上的自我监督目标衍生出来的，可以在下游任务(如代码搜索和代码生成)中进行微调，只需最小的调整。尽管这些预先训练的模型声称优于先前的技术，但它们很少遵循等效的评估协议，例如，它们几乎没有在相同的基准、任务或设置上进行评估。因此，迫切需要对预训练模型的有效性、通用性和局限性进行全面的研究，为该领域的未来发展提供启示和指导。为此，我们首先在7个代表性代码任务的大型基准上对8个开放访问的预训练模型进行了广泛的研究，以评估它们的可再现性。我们进一步比较了预训练模型和特定领域的最新技术，以验证预训练的有效性。最后，我们通过检查预训练模型在对抗性攻击下的性能变化来研究其鲁棒性。通过研究，我们发现，虽然我们通常可以复制预训练模型在其评估任务和采用基准上的原始性能，但细微的性能波动可以反驳其原始论文中的发现。此外，没有一个现有的预训练模型可以凌驾于所有其他模型之上。我们还发现，在程序理解任务中，预训练模型可以显著优于未经预训练的最先进技术。此外，我们通过对抗性攻击对自然语言编程语言预训练模型的鲁棒性进行了首次研究，发现简单的随机攻击方法可以很容易地欺骗最先进的预训练模型，从而引发安全问题。最后，我们还提供了多个实用指南，以推进未来对程序理解和生成的预训练模型的研究。

{"title":"An extensive study on pre-trained models for program understanding and generation","authors":"Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, Lingming Zhang","doi":"10.1145/3533767.3534390","DOIUrl":"https://doi.org/10.1145/3533767.3534390","url":null,"abstract":"Automatic program understanding and generation techniques could significantly advance the productivity of programmers and have been widely studied by academia and industry. Recently, the advent of pre-trained paradigm enlightens researchers to develop general-purpose pre-trained models which can be applied for a broad range of program understanding and generation tasks. Such pre-trained models, derived by self-supervised objectives on large unlabelled corpora, can be fine-tuned in downstream tasks (such as code search and code generation) with minimal adaptations. Although these pre-trained models claim superiority over the prior techniques, they seldom follow equivalent evaluation protocols, e.g., they are hardly evaluated on the identical benchmarks, tasks, or settings. Consequently, there is a pressing need for a comprehensive study of the pre-trained models on their effectiveness, versatility as well as the limitations to provide implications and guidance for the future development in this area. To this end, we first perform an extensive study of eight open-access pre-trained models over a large benchmark on seven representative code tasks to assess their reproducibility. We further compare the pre-trained models and domain-specific state-of-the-art techniques for validating pre-trained effectiveness. At last, we investigate the robustness of the pre-trained models by inspecting their performance variations under adversarial attacks. Through the study, we find that while we can in general replicate the original performance of the pre-trained models on their evaluated tasks and adopted benchmarks, subtle performance fluctuations can refute the findings in their original papers. Moreover, none of the existing pre-trained models can dominate over all other models. We also find that the pre-trained models can significantly outperform non-pre-trained state-of-the-art techniques in program understanding tasks. Furthermore, we perform the first study for natural language-programming language pre-trained model robustness via adversarial attacks and find that a simple random attack approach can easily fool the state-of-the-art pre-trained models and thus incur security issues. At last, we also provide multiple practical guidelines for advancing future research on pre-trained models for program understanding and generation.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128044612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

TensileFuzz: facilitating seed input generation in fuzzing via string constraint solving TensileFuzz:通过字符串约束求解在模糊中促进种子输入的生成

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534403

Xuwei Liu, Wei You, Zhuo Zhang, X. Zhang

Seed inputs are critical to the performance of mutation based fuzzers. Existing techniques make use of symbolic execution and gradient descent to generate seed inputs. However, these techniques are not particular suitable for input growth (i.e., making input longer and longer), a key step in seed input generation. Symbolic execution models very low level constraints and prefer fix-sized inputs whereas gradient descent only handles cases where path conditions are arithmetic functions of inputs. We observe that growing an input requires considering a number of relations: length, offset, and count, in which a field is the length of another field, the offset of another field, and the count of some pattern in another field, respective. String solver theory is particularly suitable for addressing these relations. We hence propose a novel technique called TensileFuzz, in which we identify input fields and denote them as string variables such that a seed input is the concatenation of these string variables. Additional padding string variables are inserted in between field variables. The aforementioned relations are reverse-engineered and lead to string constraints, solving which instantiates the padding variables and hence grows the input. Our technique also integrates linear regression and gradient descent to ensure the grown inputs satisfy path constraints that lead to path exploration. Our comparison with AFL, and a number of state-of-the-art fuzzers that have similar target applications, including Qsym, Angora, and SLF, shows that TensileFuzz substantially outperforms the others, by 39% - 98% in terms of path coverage.

种子输入对基于突变的模糊器的性能至关重要。现有的技术使用符号执行和梯度下降来生成种子输入。然而，这些技术并不特别适合投入物生长(即使投入物越来越长)，这是种子投入物产生的关键步骤。符号执行模型是非常低级的约束，并且倾向于固定大小的输入，而梯度下降只处理路径条件是输入的算术函数的情况。我们观察到，增加输入需要考虑许多关系:长度、偏移量和计数，其中一个字段是另一个字段的长度、另一个字段的偏移量和另一个字段中某些模式的计数。弦求解理论特别适合于处理这些关系。因此，我们提出了一种名为TensileFuzz的新技术，在该技术中，我们识别输入字段并将它们表示为字符串变量，这样种子输入就是这些字符串变量的连接。在字段变量之间插入额外的填充字符串变量。上述关系是逆向工程的，并导致字符串约束，解决实例化填充变量并因此增加输入的问题。我们的技术还集成了线性回归和梯度下降，以确保生长的输入满足导致路径探索的路径约束。我们与AFL和许多具有类似目标应用的最先进的模糊器(包括Qsym、安哥拉和SLF)进行了比较，结果表明，TensileFuzz在路径覆盖率方面明显优于其他模糊器，达到39% - 98%。

{"title":"TensileFuzz: facilitating seed input generation in fuzzing via string constraint solving","authors":"Xuwei Liu, Wei You, Zhuo Zhang, X. Zhang","doi":"10.1145/3533767.3534403","DOIUrl":"https://doi.org/10.1145/3533767.3534403","url":null,"abstract":"Seed inputs are critical to the performance of mutation based fuzzers. Existing techniques make use of symbolic execution and gradient descent to generate seed inputs. However, these techniques are not particular suitable for input growth (i.e., making input longer and longer), a key step in seed input generation. Symbolic execution models very low level constraints and prefer fix-sized inputs whereas gradient descent only handles cases where path conditions are arithmetic functions of inputs. We observe that growing an input requires considering a number of relations: length, offset, and count, in which a field is the length of another field, the offset of another field, and the count of some pattern in another field, respective. String solver theory is particularly suitable for addressing these relations. We hence propose a novel technique called TensileFuzz, in which we identify input fields and denote them as string variables such that a seed input is the concatenation of these string variables. Additional padding string variables are inserted in between field variables. The aforementioned relations are reverse-engineered and lead to string constraints, solving which instantiates the padding variables and hence grows the input. Our technique also integrates linear regression and gradient descent to ensure the grown inputs satisfy path constraints that lead to path exploration. Our comparison with AFL, and a number of state-of-the-art fuzzers that have similar target applications, including Qsym, Angora, and SLF, shows that TensileFuzz substantially outperforms the others, by 39% - 98% in terms of path coverage.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"571 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116255237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

NCScope: hardware-assisted analyzer for native code in Android apps NCScope: Android应用中原生代码的硬件辅助分析器

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3534410

Hao Zhou, Shuohan Wu, Xiapu Luo, Ting Wang, Yajin Zhou, Chao Zhang, Haipeng Cai

More and more Android apps implement their functionalities in native code, so does malware. Although various approaches have been designed to analyze the native code used by apps, they usually generate incomplete and biased results due to their limitations in obtaining and analyzing high-fidelity execution traces and memory data with low overheads. To fill the gap, in this paper, we propose and develop a novel hardware-assisted analyzer for native code in apps. We leverage ETM, a hardware feature of ARM platform, and eBPF, a kernel component of Android system, to collect real execution traces and relevant memory data of target apps, and design new methods to scrutinize native code according to the collected data. To show the unique capability of NCScope, we apply it to four applications that cannot be accomplished by existing tools, including systematic studies on self-protection and anti-analysis mechanisms implemented in native code of apps, analysis of memory corruption in native code, and identification of performance differences between functions in native code. The results uncover that only 26.8% of the analyzed financial apps implement self-protection methods in native code, implying that the security of financial apps is far from expected. Meanwhile, 78.3% of the malicious apps under analysis have anti-analysis behaviors, suggesting that NCScope is very useful to malware analysis. Moreover, NCScope can effectively detect bugs in native code and identify performance differences.

越来越多的Android应用使用原生代码实现其功能，恶意软件也是如此。尽管已经设计了各种方法来分析应用程序使用的本机代码，但由于它们在获取和分析高保真执行跟踪和低开销内存数据方面的局限性，它们通常会生成不完整和有偏差的结果。为了填补这一空白，在本文中，我们提出并开发了一种新的硬件辅助分析仪，用于应用程序中的本机代码。我们利用ARM平台的硬件特性ETM和Android系统的内核组件eBPF收集目标应用程序的真实执行轨迹和相关内存数据，并根据收集到的数据设计新的方法来审查本机代码。为了展示NCScope的独特功能，我们将其应用于现有工具无法完成的四个应用程序中，包括系统研究应用程序本机代码中实现的自我保护和反分析机制，本机代码中内存损坏的分析以及本机代码中函数之间性能差异的识别。结果显示，被分析的金融应用程序中只有26.8%在本机代码中实现了自我保护方法，这意味着金融应用程序的安全性与预期相差甚远。同时，78.3%的被分析恶意应用具有反分析行为，说明NCScope对于恶意软件分析非常有用。此外，NCScope可以有效地检测本地代码中的错误并识别性能差异。

{"title":"NCScope: hardware-assisted analyzer for native code in Android apps","authors":"Hao Zhou, Shuohan Wu, Xiapu Luo, Ting Wang, Yajin Zhou, Chao Zhang, Haipeng Cai","doi":"10.1145/3533767.3534410","DOIUrl":"https://doi.org/10.1145/3533767.3534410","url":null,"abstract":"More and more Android apps implement their functionalities in native code, so does malware. Although various approaches have been designed to analyze the native code used by apps, they usually generate incomplete and biased results due to their limitations in obtaining and analyzing high-fidelity execution traces and memory data with low overheads. To fill the gap, in this paper, we propose and develop a novel hardware-assisted analyzer for native code in apps. We leverage ETM, a hardware feature of ARM platform, and eBPF, a kernel component of Android system, to collect real execution traces and relevant memory data of target apps, and design new methods to scrutinize native code according to the collected data. To show the unique capability of NCScope, we apply it to four applications that cannot be accomplished by existing tools, including systematic studies on self-protection and anti-analysis mechanisms implemented in native code of apps, analysis of memory corruption in native code, and identification of performance differences between functions in native code. The results uncover that only 26.8% of the analyzed financial apps implement self-protection methods in native code, implying that the security of financial apps is far from expected. Meanwhile, 78.3% of the malicious apps under analysis have anti-analysis behaviors, suggesting that NCScope is very useful to malware analysis. Moreover, NCScope can effectively detect bugs in native code and identify performance differences.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115565755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Maestro: a platform for benchmarking automatic program repair tools on software vulnerabilities Maestro:针对软件漏洞对自动程序修复工具进行基准测试的平台

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2022-07-18 DOI: 10.1145/3533767.3543291

Eduard Pinconschi, Quang-Cuong Bui, Rui Abreu, P. Adão, R. Scandariato

Automating the repair of vulnerabilities is emerging in the field of software security. Previous efforts have leveraged Automated Program Repair (APR) for the task. Reproducible pipelines of repair tools on vulnerability benchmarks can promote advances in the field, such as new repair techniques. We propose Maestro, a decentralized platform with RESTful APIs for performing automated software vulnerability repair. Our platform connects benchmarks of vulnerabilities with APR tools for performing controlled experiments. It also promotes fair comparisons among different APR tools. We compare the performance of Maestro with previous studies on four APR tools in finding repairs for ten projects. Our execution time results indicate an overhead of 23 seconds for projects in C and a reduction of 14 seconds for Java projects. We introduce an agnostic platform for vulnerability repair with preliminary tools/datasets for both C and Java. Maestro is modular and can accommodate tools, benchmarks, and repair workflows with dedicated plugins.

在软件安全领域，漏洞的自动修复正在兴起。以前的工作利用了自动程序修复(APR)来完成任务。基于漏洞基准的可再生修复工具管道可以促进该领域的进步，例如新的修复技术。我们提出Maestro，这是一个分散的平台，具有RESTful api，用于执行自动软件漏洞修复。我们的平台将漏洞基准与APR工具连接起来，用于执行受控实验。它还促进了不同APR工具之间的公平比较。我们将Maestro的性能与先前对四个APR工具在十个项目中寻找修复的研究进行了比较。我们的执行时间结果表明，C语言项目的开销为23秒，Java项目的开销为14秒。我们为C和Java引入了一个不可知的漏洞修复平台，并提供了初步的工具/数据集。Maestro是模块化的，可以容纳工具，基准测试和修复工作流程与专用插件。

引用次数: 4