Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis最新文献_第5页

Challenges and opportunities: an in-depth empirical study on configuration error injection testing 挑战与机遇:配置错误注入测试的深入实证研究

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464799

Wang Li, Zhouyang Jia, Shanshan Li, Yuanliang Zhang, Teng Wang, Erci Xu, Ji Wang, Xiangke Liao

Configuration error injection testing (CEIT) could systematically evaluate software reliability and diagnosability to runtime configuration errors. This paper explores the challenges and opportunities of applying CEIT technique. We build an extensible, highly-modularized CEIT framework named CeitInspector to experiment with various CEIT techniques. Using CeitInspector, we quantitatively measure the effectiveness and efficiency of CEIT using six mature and widely-used server applications. During this process, we find a fair number of test cases are left unstudied by the prior research work. The injected configuration errors in these cases often indicate latent misconfigurations, which might be ticking time bombs in the system and lead to severe damage. We conduct an in-depth study regarding these cases to reveal the root causes, and explore possible remedies. Finally, we come up with actionable suggestions guided by our study to improve the effectiveness and efficiency of the existing CEIT techniques.

配置错误注入测试(CEIT)可以系统地评估软件运行时配置错误的可靠性和可诊断性。本文探讨了应用CEIT技术的挑战和机遇。我们构建了一个可扩展的、高度模块化的CEIT框架，名为CeitInspector，用于试验各种CEIT技术。使用CeitInspector，我们使用六个成熟且广泛使用的服务器应用程序定量测量CEIT的有效性和效率。在这个过程中，我们发现相当数量的测试用例没有被先前的研究工作所研究。在这些情况下，注入的配置错误通常表明潜在的配置错误，这可能是系统中的定时炸弹，并导致严重的破坏。我们对这些案例进行了深入的研究，以揭示根本原因，并探讨可能的补救措施。最后，我们在研究的指导下提出了可操作的建议，以提高现有CEIT技术的有效性和效率。

引用次数: 6

Toward optimal mc/dc test case generation 朝向最佳的mc/dc测试用例生成

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464841

Sangharatna Godboley, J. Jaffar, Rasool Maghareh, Arpita Dutta

MC/DC coverage prescribes a set of MC/DC sequences. Such a sequence is defined by a specification of the truth values of certain atomic boolean expressions which appear in predicates (i.e. boolean combinations of atomic boolean expressions) in the program. An execution trace satisfies the sequence if it realizes the atomic boolean conditions in accordance with the truth value specification of the sequence. An MC/DC sequence is feasible if there is one such execution trace. The overall goal for an MC/DC test generator is, for each sequence: if feasible, to generate a test input realizing the sequence; otherwise, to prove that the sequence is infeasible. In this paper, we propose a method whose aim is optimal MC/DC coverage for bounded programs, i.e. for each MC/DC sequence, the method either produces a test input, or proves that sequence is infeasible. The method is based on symbolic execution with interpolation, and in this paper, we present a customized interpolation algorithm. We then present a comprehensive experimental evaluation comparing with the only available system CBMC which can operate on reasonably large programs, and further, which can provide optimal coverage for many examples. We will use a benchmark based on RERS which contains the kinds of reactive programs for which MC/DC was motivated by. We show that our method, by a significant margin, surpasses CBMC. In particular, our method often produces an optimal MC/DC result.

MC/DC覆盖规定了一套MC/DC序列。这样的序列是由程序中出现在谓词(即原子布尔表达式的布尔组合)中的某些原子布尔表达式的真值的规范定义的。如果执行跟踪按照序列的真值规范实现了原子布尔条件，则执行跟踪满足序列。如果存在一个这样的执行跟踪，则MC/DC序列是可行的。对于每个序列，MC/DC测试生成器的总体目标是:如果可行，生成实现该序列的测试输入;否则，证明序列是不可行的。本文提出了一种以有界规划的最优MC/DC覆盖率为目标的方法，即对于每个MC/DC序列，该方法要么产生一个测试输入，要么证明序列是不可行的。该方法基于带插值的符号执行，本文提出了一种自定义的插值算法。然后，我们提出了一个全面的实验评估，与唯一可用的系统CBMC进行比较，该系统可以在相当大的程序上运行，并且可以为许多示例提供最佳覆盖。我们将使用一个基于RERS的基准，其中包含MC/DC被激发的各种反应性程序。我们表明，我们的方法在很大程度上优于CBMC。特别是，我们的方法经常产生最优的MC/DC结果。

{"title":"Toward optimal mc/dc test case generation","authors":"Sangharatna Godboley, J. Jaffar, Rasool Maghareh, Arpita Dutta","doi":"10.1145/3460319.3464841","DOIUrl":"https://doi.org/10.1145/3460319.3464841","url":null,"abstract":"MC/DC coverage prescribes a set of MC/DC sequences. Such a sequence is defined by a specification of the truth values of certain atomic boolean expressions which appear in predicates (i.e. boolean combinations of atomic boolean expressions) in the program. An execution trace satisfies the sequence if it realizes the atomic boolean conditions in accordance with the truth value specification of the sequence. An MC/DC sequence is feasible if there is one such execution trace. The overall goal for an MC/DC test generator is, for each sequence: if feasible, to generate a test input realizing the sequence; otherwise, to prove that the sequence is infeasible. In this paper, we propose a method whose aim is optimal MC/DC coverage for bounded programs, i.e. for each MC/DC sequence, the method either produces a test input, or proves that sequence is infeasible. The method is based on symbolic execution with interpolation, and in this paper, we present a customized interpolation algorithm. We then present a comprehensive experimental evaluation comparing with the only available system CBMC which can operate on reasonably large programs, and further, which can provide optimal coverage for many examples. We will use a benchmark based on RERS which contains the kinds of reactive programs for which MC/DC was motivated by. We show that our method, by a significant margin, surpasses CBMC. In particular, our method often produces an optimal MC/DC result.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131725171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

TauMed: test augmentation of deep learning in medical diagnosis TauMed:深度学习在医学诊断中的测试增强

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3469080

Yunhan Hou, Jiawei Liu, Daiwei Wang, Jiawei He, Chunrong Fang, Zhenyu Chen

Deep learning has made great progress in medical diagnosis. However, due to data standardization and privacy restriction, the acquisition and sharing of medical image data have been hindered, leading to the unacceptable accuracy of some intelligent medical diagnosis models. Another concern is data quality. If insufficient quantity and low-quality data are used for training and testing medical diagnosis models, it may cause serious medical accidents. We always use data augmentation to deal with it, and one of the most representative ways is through mutation relation. However, although common mutation methods can increase the amount of medical data, the quality of the image cannot be guaranteed due to the particularity of medical image. Therefore, combined with the characteristics of medical images, we propose TauMed, which implements augmentation techniques based on a series of mutation rules and domain semantics on medical datasets to generate sufficient and high-quality images. Moreover, we chose the ResNet-50 model to experiment with the augmented dataset and compared the results with two main popular mutation tools. The experimental result indicates that TauMed can improve the classification accuracy of the model effectively, and the quality of augmented images is higher than the other two tools. Its video is at https://www.youtube.com/watch?v=O8W8I7U_eqk and TauMed can be used at http://121.196.124.158:9500/.

深度学习在医学诊断方面取得了很大进展。然而，由于数据标准化和隐私性的限制，阻碍了医学图像数据的采集和共享，导致一些智能医疗诊断模型的准确率难以接受。另一个问题是数据质量。如果使用数量不足、质量不高的数据进行医学诊断模型的训练和测试，可能会造成严重的医疗事故。我们通常采用数据增强的方法来处理，其中最具代表性的一种方法是通过突变关系。然而，常用的突变方法虽然可以增加医学数据量，但由于医学图像的特殊性，无法保证图像的质量。因此，我们结合医学图像的特点，提出了基于一系列突变规则和领域语义的医学数据集增强技术TauMed，以生成足够的高质量图像。此外，我们选择了ResNet-50模型对增强数据集进行实验，并将结果与两种主要流行的突变工具进行了比较。实验结果表明，TauMed可以有效地提高模型的分类精度，增强图像的质量高于其他两种工具。它的视频在https://www.youtube.com/watch?v=O8W8I7U_eqk上，TauMed可以在http://121.196.124.158:9500/上使用。

{"title":"TauMed: test augmentation of deep learning in medical diagnosis","authors":"Yunhan Hou, Jiawei Liu, Daiwei Wang, Jiawei He, Chunrong Fang, Zhenyu Chen","doi":"10.1145/3460319.3469080","DOIUrl":"https://doi.org/10.1145/3460319.3469080","url":null,"abstract":"Deep learning has made great progress in medical diagnosis. However, due to data standardization and privacy restriction, the acquisition and sharing of medical image data have been hindered, leading to the unacceptable accuracy of some intelligent medical diagnosis models. Another concern is data quality. If insufficient quantity and low-quality data are used for training and testing medical diagnosis models, it may cause serious medical accidents. We always use data augmentation to deal with it, and one of the most representative ways is through mutation relation. However, although common mutation methods can increase the amount of medical data, the quality of the image cannot be guaranteed due to the particularity of medical image. Therefore, combined with the characteristics of medical images, we propose TauMed, which implements augmentation techniques based on a series of mutation rules and domain semantics on medical datasets to generate sufficient and high-quality images. Moreover, we chose the ResNet-50 model to experiment with the augmented dataset and compared the results with two main popular mutation tools. The experimental result indicates that TauMed can improve the classification accuracy of the model effectively, and the quality of augmented images is higher than the other two tools. Its video is at https://www.youtube.com/watch?v=O8W8I7U_eqk and TauMed can be used at http://121.196.124.158:9500/.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133465176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Understanding and finding system setting-related defects in Android apps 理解并发现Android应用中与系统设置相关的缺陷

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464806

Jingling Sun, Ting Su, Junxin Li, Zhen Dong, G. Pu, Tao Xie, Z. Su

Android, the most popular mobile system, offers a number of user-configurable system settings (e.g., network, location, and permission) for controlling devices and apps. Even popular, well-tested apps may fail to properly adapt their behaviors to diverse setting changes, thus frustrating their users. However, there exists no effort to systematically investigate such defects. To this end, we conduct the first empirical study to understand the characteristics of these setting-related defects (in short as "setting defects"), which reside in apps and are triggered by system setting changes. We devote substantial manual effort (over three person-months) to analyze 1,074 setting defects from 180 popular apps on GitHub. We investigate their impact, root causes, and consequences. We find that setting defects have a wide, diverse impact on apps' correctness, and the majority of these defects (≈70.7%) cause non-crash (logic) failures, and thus could not be automatically detected by existing app testing techniques due to the lack of strong test oracles. Motivated and guided by our study, we propose setting-wise metamorphic fuzzing, the first automated testing approach to effectively detect setting defects without explicit oracles. Our key insight is that an app's behavior should, in most cases, remain consistent if a given setting is changed and later properly restored, or exhibit expected differences if not restored. We realize our approach in SetDroid, an automated, end-to-end GUI testing tool, for detecting both crash and non-crash setting defects. SetDroid has been evaluated on 26 popular, open-source apps and detected 42 unique, previously unknown setting defects in 24 apps. To date, 33 have been confirmed and 21 fixed. We also apply SetDroid on five highly popular industrial apps, namely WeChat, QQMail, TikTok, CapCut, and AlipayHK, all of which each have billions of monthly active users. SetDroid successfully detects 17 previously unknown setting defects in these apps' latest releases, and all defects have been confirmed and fixed by the app vendors. The majority of SetDroid-detected defects (49 out of 59) cause non-crash failures, which could not be detected by existing testing tools (as our evaluation confirms). These results demonstrate SetDroid's strong effectiveness and practicality.

Android，最流行的移动系统，提供了许多用户可配置的系统设置(例如，网络，位置和权限)来控制设备和应用程序。即使是受欢迎的、经过良好测试的应用程序也可能无法适当地调整其行为以适应各种设置变化，从而使用户感到沮丧。然而，目前还没有对这些缺陷进行系统的研究。为此，我们进行了第一次实证研究，以了解这些与设置相关的缺陷(简称“设置缺陷”)的特征，这些缺陷存在于应用程序中，并由系统设置更改触发。我们投入了大量的人工工作(超过三个人月)来分析GitHub上180个流行应用程序的1,074个设置缺陷。我们调查它们的影响、根本原因和后果。我们发现，设置缺陷对应用程序的正确性有广泛而多样的影响，其中大多数缺陷(≈70.7%)会导致非崩溃(逻辑)故障，由于缺乏强大的测试oracle，现有的应用程序测试技术无法自动检测到这些缺陷。在我们的研究的激励和指导下，我们提出了设置智能的变形模糊，这是第一个自动化的测试方法，可以有效地检测设置缺陷，而不需要明确的指示。我们的关键观点是，在大多数情况下，如果一个给定的设置被改变，然后适当地恢复，应用程序的行为应该保持一致，或者如果没有恢复，表现出预期的差异。我们在SetDroid中实现了我们的方法，这是一个自动化的端到端GUI测试工具，用于检测崩溃和非崩溃设置缺陷。SetDroid已经对26个流行的开源应用程序进行了评估，并在24个应用程序中发现了42个独特的、以前未知的设置缺陷。到目前为止，已经确认了33个，确定了21个。我们还将SetDroid应用于五个非常受欢迎的工业应用程序，即微信、QQMail、抖音、CapCut和支付宝，每个应用程序都有数十亿的月活跃用户。SetDroid在这些应用的最新版本中成功检测到17个以前未知的设置缺陷，并且所有缺陷都已被应用供应商确认并修复。大多数setdroid检测到的缺陷(59个中的49个)会导致非崩溃故障，这是现有测试工具无法检测到的(正如我们的评估所证实的)。这些结果证明了SetDroid具有很强的有效性和实用性。

{"title":"Understanding and finding system setting-related defects in Android apps","authors":"Jingling Sun, Ting Su, Junxin Li, Zhen Dong, G. Pu, Tao Xie, Z. Su","doi":"10.1145/3460319.3464806","DOIUrl":"https://doi.org/10.1145/3460319.3464806","url":null,"abstract":"Android, the most popular mobile system, offers a number of user-configurable system settings (e.g., network, location, and permission) for controlling devices and apps. Even popular, well-tested apps may fail to properly adapt their behaviors to diverse setting changes, thus frustrating their users. However, there exists no effort to systematically investigate such defects. To this end, we conduct the first empirical study to understand the characteristics of these setting-related defects (in short as \"setting defects\"), which reside in apps and are triggered by system setting changes. We devote substantial manual effort (over three person-months) to analyze 1,074 setting defects from 180 popular apps on GitHub. We investigate their impact, root causes, and consequences. We find that setting defects have a wide, diverse impact on apps' correctness, and the majority of these defects (≈70.7%) cause non-crash (logic) failures, and thus could not be automatically detected by existing app testing techniques due to the lack of strong test oracles. Motivated and guided by our study, we propose setting-wise metamorphic fuzzing, the first automated testing approach to effectively detect setting defects without explicit oracles. Our key insight is that an app's behavior should, in most cases, remain consistent if a given setting is changed and later properly restored, or exhibit expected differences if not restored. We realize our approach in SetDroid, an automated, end-to-end GUI testing tool, for detecting both crash and non-crash setting defects. SetDroid has been evaluated on 26 popular, open-source apps and detected 42 unique, previously unknown setting defects in 24 apps. To date, 33 have been confirmed and 21 fixed. We also apply SetDroid on five highly popular industrial apps, namely WeChat, QQMail, TikTok, CapCut, and AlipayHK, all of which each have billions of monthly active users. SetDroid successfully detects 17 previously unknown setting defects in these apps' latest releases, and all defects have been confirmed and fixed by the app vendors. The majority of SetDroid-detected defects (49 out of 59) cause non-crash failures, which could not be detected by existing testing tools (as our evaluation confirms). These results demonstrate SetDroid's strong effectiveness and practicality.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123098170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

iDEV: exploring and exploiting semantic deviations in ARM instruction processing iDEV:探索和利用ARM指令处理中的语义偏差

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464842

Shisong Qin, Chao Zhang, Kaixiang Chen, Zheming Li

ARM has become the most competitive processor architecture. Many platforms or tools are developed to execute or analyze ARM instructions, including various commercial CPUs, emulators, and binary analysis tools. However, they have deviations when processing the same ARM instructions, and little attention has been paid to systematically analyze such semantic deviations, not to mention the security implications of such deviations. In this paper, we conduct an empirical study on the ARM Instruction Semantic Deviation (ISDev) issue. First, we classify this issue into several categories and analyze the security implications behind them. Then, we further demonstrate several novel attacks which utilize the ISDev issue, including stealthy targeted attacks and targeted defense evasion. Such attacks could exploit the semantic deviations to generate malware that is specific to certain platforms or able to detect and bypass certain detection solutions. We have developed a framework iDEV to systematically explore the ISDev issue in existing ARM instructions processing tools and platforms via differential testing. We have evaluated iDEV on four hardware devices, the QEMU emulator, and five disassemblers which could process the ARMv7-A instruction set. The evaluation results show that, over six million instructions could cause dynamic executors (i.e., CPUs and QEMU) to present different runtime behaviors, and over eight million instructions could cause static disassemblers yielding different decoding results, and over one million instructions cause inconsistency between dynamic executors and static disassemblers. After analyzing the root causes of each type of deviation, we point out they are mostly due to ARM unpredictable instructions and program defects.

ARM已经成为最具竞争力的处理器架构。开发了许多平台或工具来执行或分析ARM指令，包括各种商用cpu、模拟器和二进制分析工具。然而，它们在处理相同的ARM指令时存在偏差，并且很少有人注意系统地分析这种语义偏差，更不用说这种偏差的安全含义了。在本文中，我们对ARM指令语义偏差(ISDev)问题进行了实证研究。首先，我们将此问题分为几类，并分析它们背后的安全含义。然后，我们进一步展示了几种利用ISDev问题的新型攻击，包括隐形目标攻击和目标防御逃避。这种攻击可以利用语义偏差来生成特定于某些平台的恶意软件，或者能够检测并绕过某些检测解决方案。我们开发了一个框架iDEV，通过差分测试系统地探索现有ARM指令处理工具和平台中的ISDev问题。我们在四种硬件设备、QEMU仿真器和五种可以处理ARMv7-A指令集的反汇编器上对iDEV进行了评估。评估结果表明，超过600万条指令会导致动态执行器(即cpu和QEMU)呈现不同的运行时行为，超过800万条指令会导致静态反汇编器产生不同的解码结果，超过100万条指令会导致动态执行器与静态反汇编器不一致。在分析了每种偏差的根源后，我们指出它们大多是由于ARM不可预知的指令和程序缺陷造成的。

{"title":"iDEV: exploring and exploiting semantic deviations in ARM instruction processing","authors":"Shisong Qin, Chao Zhang, Kaixiang Chen, Zheming Li","doi":"10.1145/3460319.3464842","DOIUrl":"https://doi.org/10.1145/3460319.3464842","url":null,"abstract":"ARM has become the most competitive processor architecture. Many platforms or tools are developed to execute or analyze ARM instructions, including various commercial CPUs, emulators, and binary analysis tools. However, they have deviations when processing the same ARM instructions, and little attention has been paid to systematically analyze such semantic deviations, not to mention the security implications of such deviations. In this paper, we conduct an empirical study on the ARM Instruction Semantic Deviation (ISDev) issue. First, we classify this issue into several categories and analyze the security implications behind them. Then, we further demonstrate several novel attacks which utilize the ISDev issue, including stealthy targeted attacks and targeted defense evasion. Such attacks could exploit the semantic deviations to generate malware that is specific to certain platforms or able to detect and bypass certain detection solutions. We have developed a framework iDEV to systematically explore the ISDev issue in existing ARM instructions processing tools and platforms via differential testing. We have evaluated iDEV on four hardware devices, the QEMU emulator, and five disassemblers which could process the ARMv7-A instruction set. The evaluation results show that, over six million instructions could cause dynamic executors (i.e., CPUs and QEMU) to present different runtime behaviors, and over eight million instructions could cause static disassemblers yielding different decoding results, and over one million instructions cause inconsistency between dynamic executors and static disassemblers. After analyzing the root causes of each type of deviation, we point out they are mostly due to ARM unpredictable instructions and program defects.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133655051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

DeepCrime: mutation testing of deep learning systems based on real faults 深度犯罪:基于真实故障的深度学习系统突变测试

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464825

Nargiz Humbatova, Gunel Jahangirova, P. Tonella

Deep Learning (DL) solutions are increasingly adopted, but how to test them remains a major open research problem. Existing and new testing techniques have been proposed for and adapted to DL systems, including mutation testing. However, no approach has investigated the possibility to simulate the effects of real DL faults by means of mutation operators. We have defined 35 DL mutation operators relying on 3 empirical studies about real faults in DL systems. We followed a systematic process to extract the mutation operators from the existing fault taxonomies, with a formal phase of conflict resolution in case of disagreement. We have implemented 24 of these DL mutation operators into DeepCrime, the first source-level pre-training mutation tool based on real DL faults. We have assessed our mutation operators to understand their characteristics: whether they produce interesting, i.e., killable but not trivial, mutations. Then, we have compared the sensitivity of our tool to the changes in the quality of test data with that of DeepMutation++, an existing post-training DL mutation tool.

深度学习(DL)解决方案被越来越多地采用，但如何测试它们仍然是一个主要的开放研究问题。现有的和新的测试技术已经提出并适应于DL系统，包括突变测试。然而，没有一种方法研究了利用突变算子模拟真实DL故障影响的可能性。基于对深度学习系统真实故障的3个实证研究，我们定义了35个深度学习突变算子。我们遵循一个系统的过程，从现有的故障分类中提取突变操作符，并在不同意的情况下正式解决冲突阶段。我们已经在DeepCrime中实现了24个这样的深度学习突变算子，这是第一个基于真实深度学习故障的源级预训练突变工具。我们已经评估了我们的突变操作符，以了解它们的特征:它们是否产生有趣的，即可杀死但不是微不足道的突变。然后，我们将我们的工具对测试数据质量变化的敏感性与DeepMutation++(一种现有的训练后深度学习突变工具)进行了比较。

引用次数: 56

Efficient white-box fairness testing through gradient search 通过梯度搜索高效的白盒公平性测试

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464820

Lingfeng Zhang, Yueling Zhang, M. Zhang

Deep learning (DL) systems are increasingly deployed for autonomous decision-making in a wide range of applications. Apart from the robustness and safety, fairness is also an important property that a well-designed DL system should have. To evaluate and improve individual fairness of a model, systematic test case generation for identifying individual discriminatory instances in the input space is essential. In this paper, we propose a framework EIDIG for efficiently discovering individual fairness violation. Our technique combines a global generation phase for rapidly generating a set of diverse discriminatory seeds with a local generation phase for generating as many individual discriminatory instances as possible around these seeds under the guidance of the gradient of the model output. In each phase, prior information at successive iterations is fully exploited to accelerate convergence of iterative optimization or reduce frequency of gradient calculation. Our experimental results show that, on average, our approach EIDIG generates 19.11% more individual discriminatory instances with a speedup of 121.49% when compared with the state-of-the-art method and mitigates individual discrimination by 80.03% with a limited accuracy loss after retraining.

深度学习(DL)系统越来越多地应用于广泛的自主决策。除了鲁棒性和安全性之外，公平性也是一个设计良好的深度学习系统应该具备的重要属性。为了评估和提高模型的个体公平性，在输入空间中识别个体歧视性实例的系统测试用例生成是必不可少的。在本文中，我们提出了一个有效发现个人公平违规的EIDIG框架。我们的技术结合了快速生成一组多样化的判别种子的全局生成阶段和在模型输出梯度的指导下在这些种子周围生成尽可能多的个体判别实例的局部生成阶段。在每个阶段，充分利用连续迭代的先验信息，加快迭代优化的收敛速度或降低梯度计算的频率。实验结果表明，与现有方法相比，EIDIG方法产生的个体歧视实例平均增加了19.11%，加速速度提高了121.49%，在再训练后的精度损失有限的情况下，减少了80.03%的个体歧视。

{"title":"Efficient white-box fairness testing through gradient search","authors":"Lingfeng Zhang, Yueling Zhang, M. Zhang","doi":"10.1145/3460319.3464820","DOIUrl":"https://doi.org/10.1145/3460319.3464820","url":null,"abstract":"Deep learning (DL) systems are increasingly deployed for autonomous decision-making in a wide range of applications. Apart from the robustness and safety, fairness is also an important property that a well-designed DL system should have. To evaluate and improve individual fairness of a model, systematic test case generation for identifying individual discriminatory instances in the input space is essential. In this paper, we propose a framework EIDIG for efficiently discovering individual fairness violation. Our technique combines a global generation phase for rapidly generating a set of diverse discriminatory seeds with a local generation phase for generating as many individual discriminatory instances as possible around these seeds under the guidance of the gradient of the model output. In each phase, prior information at successive iterations is fully exploited to accelerate convergence of iterative optimization or reduce frequency of gradient calculation. Our experimental results show that, on average, our approach EIDIG generates 19.11% more individual discriminatory instances with a speedup of 121.49% when compared with the state-of-the-art method and mitigates individual discrimination by 80.03% with a limited accuracy loss after retraining.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129305923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

DialTest: automated testing for recurrent-neural-network-driven dialogue systems DialTest:自动测试循环神经网络驱动的对话系统

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3464829

Zixi Liu, Yang Feng, Zhenyu Chen

With the tremendous advancement of recurrent neural networks(RNN), dialogue systems have achieved significant development. Many RNN-driven dialogue systems, such as Siri, Google Home, and Alexa, have been deployed to assist various tasks. However, accompanying this outstanding performance, RNN-driven dialogue systems, which are essentially a kind of software, could also produce erroneous behaviors and result in massive losses. Meanwhile, the complexity and intractability of RNN models that power the dialogue systems make their testing challenging. In this paper, we design and implement DialTest, the first RNN-driven dialogue system testing tool. DialTest employs a series of transformation operators to make realistic changes on seed data while preserving their oracle information properly. To improve the efficiency of detecting faults, DialTest further adopts Gini impurity to guide the test generation process. We conduct extensive experiments to validate DialTest. We first experiment it on two fundamental tasks, i.e., intent detection and slot filling, of natural language understanding. The experiment results show that DialTest can effectively detect hundreds of erroneous behaviors for different RNN-driven natural language understanding (NLU) modules of dialogue systems and improve their accuracy via retraining with the generated data. Further, we conduct a case study on an industrial dialogue system to investigate the performance of DialTest under the real usage scenario. The study shows DialTest can detect errors and improve the robustness of RNN-driven dialogue systems effectively.

随着循环神经网络(RNN)的巨大进步，对话系统得到了长足的发展。许多rnn驱动的对话系统，如Siri、Google Home和Alexa，已经被部署来协助各种任务。然而，伴随着这种优异的表现，rnn驱动的对话系统本质上是一种软件，也可能产生错误的行为，造成巨大的损失。同时，为对话系统提供动力的RNN模型的复杂性和难处使其测试具有挑战性。在本文中，我们设计并实现了DialTest，第一个rnn驱动的对话系统测试工具。DialTest使用一系列转换操作符对种子数据进行实际更改，同时正确地保留其oracle信息。为了提高检测故障的效率，DialTest进一步采用基尼杂质来指导测试生成过程。我们进行了大量的实验来验证DialTest。我们首先在自然语言理解的两个基本任务上进行了实验，即意图检测和槽填充。实验结果表明，DialTest可以有效地检测出对话系统中不同的rnn驱动的自然语言理解(NLU)模块的数百种错误行为，并通过对生成的数据进行再训练来提高其准确性。此外，我们对一个工业对话系统进行了案例研究，以调查DialTest在实际使用场景下的性能。研究表明，DialTest可以有效地检测错误，提高rnn驱动对话系统的鲁棒性。

{"title":"DialTest: automated testing for recurrent-neural-network-driven dialogue systems","authors":"Zixi Liu, Yang Feng, Zhenyu Chen","doi":"10.1145/3460319.3464829","DOIUrl":"https://doi.org/10.1145/3460319.3464829","url":null,"abstract":"With the tremendous advancement of recurrent neural networks(RNN), dialogue systems have achieved significant development. Many RNN-driven dialogue systems, such as Siri, Google Home, and Alexa, have been deployed to assist various tasks. However, accompanying this outstanding performance, RNN-driven dialogue systems, which are essentially a kind of software, could also produce erroneous behaviors and result in massive losses. Meanwhile, the complexity and intractability of RNN models that power the dialogue systems make their testing challenging. In this paper, we design and implement DialTest, the first RNN-driven dialogue system testing tool. DialTest employs a series of transformation operators to make realistic changes on seed data while preserving their oracle information properly. To improve the efficiency of detecting faults, DialTest further adopts Gini impurity to guide the test generation process. We conduct extensive experiments to validate DialTest. We first experiment it on two fundamental tasks, i.e., intent detection and slot filling, of natural language understanding. The experiment results show that DialTest can effectively detect hundreds of erroneous behaviors for different RNN-driven natural language understanding (NLU) modules of dialogue systems and improve their accuracy via retraining with the generated data. Further, we conduct a case study on an industrial dialogue system to investigate the performance of DialTest under the real usage scenario. The study shows DialTest can detect errors and improve the robustness of RNN-driven dialogue systems effectively.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121091030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

C4: the C compiler concurrency checker C编译器并发检查器

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-11 DOI: 10.1145/3460319.3469079

Matt Windsor, A. Donaldson, John Wickerson

The correct compilation of atomic-action concurrency is vital now that multicore processors are ubiquitous. Despite much recent work on automated compiler testing, little existing tooling can test how real-world compilers handle compilation of atomic-action code. We demonstrate C4, a tool for exploring the concurrency behaviour of real-world C compilers such as GCC and LLVM. C4 automates a workflow based on generating, fuzzing, and executing litmus tests. So far, C4 has found two new control-flow bugs in GCC and IBM XL, and reproduced two historic concurrency bugs in GCC 4.

随着多核处理器的普及，原子动作并发性的正确编译变得至关重要。尽管最近在自动化编译器测试方面做了很多工作，但很少有现有的工具可以测试真实世界的编译器如何处理原子动作代码的编译。我们将演示C4，这是一个用于探索实际C编译器(如GCC和LLVM)的并发行为的工具。C4在生成、模糊测试和执行石蕊测试的基础上自动化工作流。到目前为止，C4已经在GCC和IBM XL中发现了两个新的控制流错误，并在GCC 4中重现了两个历史上的并发性错误。

引用次数: 5

HomDroid: detecting Android covert malware by social-network homophily analysis HomDroid:通过社交网络同质性分析检测Android隐蔽恶意软件

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pub Date : 2021-07-10 DOI: 10.1145/3460319.3464833

Yueming Wu, Deqing Zou, Wei Yang, Xiang Li, Hai Jin

Android has become the most popular mobile operating system. Correspondingly, an increasing number of Android malware has been developed and spread to steal users’ private information. There exists one type of malware whose benign behaviors are developed to camouflage malicious behaviors. The malicious component occupies a small part of the entire code of the application (app for short), and the malicious part is strongly coupled with the benign part. In this case, the malware may cause false negatives when malware detectors extract features from the entire apps to conduct classification because the malicious features of these apps may be hidden among benign features. Moreover, some previous work aims to divide the entire app into several parts to discover the malicious part. However, the premise of these methods to commence app partition is that the connections between the normal part and the malicious part are weak (repackaged malware). In this paper, we call this type of malware as Android covert malware and generate the first dataset of covert malware. To detect covert malware samples, we first conduct static analysis to extract the function call graphs. Through the deep analysis on call graphs, we observe that although the correlations between the normal part and the malicious part in these graphs are high, the degree of these correlations has a unique range of distribution. Based on the observation, we design a novel system, HomDroid, to detect covert malware by analyzing the homophily of call graphs. We identify the ideal threshold of correlation to distinguish the normal part and the malicious part based on the evaluation results on a dataset of 4,840 benign apps and 3,385 covert malicious apps. According to our evaluation results, HomDroid is capable of detecting 96.8% of covert malware while the False Negative Rates of another four state-of-the-art systems (PerDroid, Drebin, MaMaDroid, and IntDroid) are 30.7%, 16.3%, 15.2%, and 10.4%, respectively.

Android已经成为最受欢迎的移动操作系统。相应地，越来越多的Android恶意软件被开发和传播，窃取用户的私人信息。存在一种恶意软件，其良性行为被发展为伪装恶意行为。恶意组件只占应用程序(简称app)整个代码的一小部分，恶意部分与良性部分强耦合。在这种情况下，恶意软件可能会在恶意软件检测器从整个应用程序中提取特征进行分类时造成误报，因为这些应用程序的恶意特征可能隐藏在良性特征中。此外，之前的一些工作旨在将整个应用程序分成几个部分来发现恶意部分。然而，这些方法开始应用分区的前提是正常部分和恶意部分之间的连接很弱(重新包装的恶意软件)。本文将这类恶意软件称为Android隐蔽恶意软件，并生成了第一个隐蔽恶意软件数据集。为了检测隐蔽的恶意软件样本，我们首先进行静态分析以提取函数调用图。通过对调用图的深入分析，我们发现虽然这些图中正常部分和恶意部分之间的相关性很高，但这些相关性的程度具有独特的分布范围。在此基础上，我们设计了一个新的系统HomDroid，通过分析调用图的同态性来检测隐蔽的恶意软件。基于4,840个良性应用和3,385个隐蔽恶意应用的数据集的评估结果，我们确定了区分正常部分和恶意部分的理想相关性阈值。根据我们的评估结果，HomDroid能够检测到96.8%的隐蔽恶意软件，而另外四个最先进的系统(PerDroid, Drebin, MaMaDroid和IntDroid)的假阴性率分别为30.7%，16.3%，15.2%和10.4%。

{"title":"HomDroid: detecting Android covert malware by social-network homophily analysis","authors":"Yueming Wu, Deqing Zou, Wei Yang, Xiang Li, Hai Jin","doi":"10.1145/3460319.3464833","DOIUrl":"https://doi.org/10.1145/3460319.3464833","url":null,"abstract":"Android has become the most popular mobile operating system. Correspondingly, an increasing number of Android malware has been developed and spread to steal users’ private information. There exists one type of malware whose benign behaviors are developed to camouflage malicious behaviors. The malicious component occupies a small part of the entire code of the application (app for short), and the malicious part is strongly coupled with the benign part. In this case, the malware may cause false negatives when malware detectors extract features from the entire apps to conduct classification because the malicious features of these apps may be hidden among benign features. Moreover, some previous work aims to divide the entire app into several parts to discover the malicious part. However, the premise of these methods to commence app partition is that the connections between the normal part and the malicious part are weak (repackaged malware). In this paper, we call this type of malware as Android covert malware and generate the first dataset of covert malware. To detect covert malware samples, we first conduct static analysis to extract the function call graphs. Through the deep analysis on call graphs, we observe that although the correlations between the normal part and the malicious part in these graphs are high, the degree of these correlations has a unique range of distribution. Based on the observation, we design a novel system, HomDroid, to detect covert malware by analyzing the homophily of call graphs. We identify the ideal threshold of correlation to distinguish the normal part and the malicious part based on the evaluation results on a dataset of 4,840 benign apps and 3,385 covert malicious apps. According to our evaluation results, HomDroid is capable of detecting 96.8% of covert malware while the False Negative Rates of another four state-of-the-art systems (PerDroid, Drebin, MaMaDroid, and IntDroid) are 30.7%, 16.3%, 15.2%, and 10.4%, respectively.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124524546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11