Wang Li, Zhouyang Jia, Shanshan Li, Yuanliang Zhang, Teng Wang, Erci Xu, Ji Wang, Xiangke Liao
Configuration error injection testing (CEIT) could systematically evaluate software reliability and diagnosability to runtime configuration errors. This paper explores the challenges and opportunities of applying CEIT technique. We build an extensible, highly-modularized CEIT framework named CeitInspector to experiment with various CEIT techniques. Using CeitInspector, we quantitatively measure the effectiveness and efficiency of CEIT using six mature and widely-used server applications. During this process, we find a fair number of test cases are left unstudied by the prior research work. The injected configuration errors in these cases often indicate latent misconfigurations, which might be ticking time bombs in the system and lead to severe damage. We conduct an in-depth study regarding these cases to reveal the root causes, and explore possible remedies. Finally, we come up with actionable suggestions guided by our study to improve the effectiveness and efficiency of the existing CEIT techniques.
{"title":"Challenges and opportunities: an in-depth empirical study on configuration error injection testing","authors":"Wang Li, Zhouyang Jia, Shanshan Li, Yuanliang Zhang, Teng Wang, Erci Xu, Ji Wang, Xiangke Liao","doi":"10.1145/3460319.3464799","DOIUrl":"https://doi.org/10.1145/3460319.3464799","url":null,"abstract":"Configuration error injection testing (CEIT) could systematically evaluate software reliability and diagnosability to runtime configuration errors. This paper explores the challenges and opportunities of applying CEIT technique. We build an extensible, highly-modularized CEIT framework named CeitInspector to experiment with various CEIT techniques. Using CeitInspector, we quantitatively measure the effectiveness and efficiency of CEIT using six mature and widely-used server applications. During this process, we find a fair number of test cases are left unstudied by the prior research work. The injected configuration errors in these cases often indicate latent misconfigurations, which might be ticking time bombs in the system and lead to severe damage. We conduct an in-depth study regarding these cases to reveal the root causes, and explore possible remedies. Finally, we come up with actionable suggestions guided by our study to improve the effectiveness and efficiency of the existing CEIT techniques.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132821437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sangharatna Godboley, J. Jaffar, Rasool Maghareh, Arpita Dutta
MC/DC coverage prescribes a set of MC/DC sequences. Such a sequence is defined by a specification of the truth values of certain atomic boolean expressions which appear in predicates (i.e. boolean combinations of atomic boolean expressions) in the program. An execution trace satisfies the sequence if it realizes the atomic boolean conditions in accordance with the truth value specification of the sequence. An MC/DC sequence is feasible if there is one such execution trace. The overall goal for an MC/DC test generator is, for each sequence: if feasible, to generate a test input realizing the sequence; otherwise, to prove that the sequence is infeasible. In this paper, we propose a method whose aim is optimal MC/DC coverage for bounded programs, i.e. for each MC/DC sequence, the method either produces a test input, or proves that sequence is infeasible. The method is based on symbolic execution with interpolation, and in this paper, we present a customized interpolation algorithm. We then present a comprehensive experimental evaluation comparing with the only available system CBMC which can operate on reasonably large programs, and further, which can provide optimal coverage for many examples. We will use a benchmark based on RERS which contains the kinds of reactive programs for which MC/DC was motivated by. We show that our method, by a significant margin, surpasses CBMC. In particular, our method often produces an optimal MC/DC result.
{"title":"Toward optimal mc/dc test case generation","authors":"Sangharatna Godboley, J. Jaffar, Rasool Maghareh, Arpita Dutta","doi":"10.1145/3460319.3464841","DOIUrl":"https://doi.org/10.1145/3460319.3464841","url":null,"abstract":"MC/DC coverage prescribes a set of MC/DC sequences. Such a sequence is defined by a specification of the truth values of certain atomic boolean expressions which appear in predicates (i.e. boolean combinations of atomic boolean expressions) in the program. An execution trace satisfies the sequence if it realizes the atomic boolean conditions in accordance with the truth value specification of the sequence. An MC/DC sequence is feasible if there is one such execution trace. The overall goal for an MC/DC test generator is, for each sequence: if feasible, to generate a test input realizing the sequence; otherwise, to prove that the sequence is infeasible. In this paper, we propose a method whose aim is optimal MC/DC coverage for bounded programs, i.e. for each MC/DC sequence, the method either produces a test input, or proves that sequence is infeasible. The method is based on symbolic execution with interpolation, and in this paper, we present a customized interpolation algorithm. We then present a comprehensive experimental evaluation comparing with the only available system CBMC which can operate on reasonably large programs, and further, which can provide optimal coverage for many examples. We will use a benchmark based on RERS which contains the kinds of reactive programs for which MC/DC was motivated by. We show that our method, by a significant margin, surpasses CBMC. In particular, our method often produces an optimal MC/DC result.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131725171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning has made great progress in medical diagnosis. However, due to data standardization and privacy restriction, the acquisition and sharing of medical image data have been hindered, leading to the unacceptable accuracy of some intelligent medical diagnosis models. Another concern is data quality. If insufficient quantity and low-quality data are used for training and testing medical diagnosis models, it may cause serious medical accidents. We always use data augmentation to deal with it, and one of the most representative ways is through mutation relation. However, although common mutation methods can increase the amount of medical data, the quality of the image cannot be guaranteed due to the particularity of medical image. Therefore, combined with the characteristics of medical images, we propose TauMed, which implements augmentation techniques based on a series of mutation rules and domain semantics on medical datasets to generate sufficient and high-quality images. Moreover, we chose the ResNet-50 model to experiment with the augmented dataset and compared the results with two main popular mutation tools. The experimental result indicates that TauMed can improve the classification accuracy of the model effectively, and the quality of augmented images is higher than the other two tools. Its video is at https://www.youtube.com/watch?v=O8W8I7U_eqk and TauMed can be used at http://121.196.124.158:9500/.
{"title":"TauMed: test augmentation of deep learning in medical diagnosis","authors":"Yunhan Hou, Jiawei Liu, Daiwei Wang, Jiawei He, Chunrong Fang, Zhenyu Chen","doi":"10.1145/3460319.3469080","DOIUrl":"https://doi.org/10.1145/3460319.3469080","url":null,"abstract":"Deep learning has made great progress in medical diagnosis. However, due to data standardization and privacy restriction, the acquisition and sharing of medical image data have been hindered, leading to the unacceptable accuracy of some intelligent medical diagnosis models. Another concern is data quality. If insufficient quantity and low-quality data are used for training and testing medical diagnosis models, it may cause serious medical accidents. We always use data augmentation to deal with it, and one of the most representative ways is through mutation relation. However, although common mutation methods can increase the amount of medical data, the quality of the image cannot be guaranteed due to the particularity of medical image. Therefore, combined with the characteristics of medical images, we propose TauMed, which implements augmentation techniques based on a series of mutation rules and domain semantics on medical datasets to generate sufficient and high-quality images. Moreover, we chose the ResNet-50 model to experiment with the augmented dataset and compared the results with two main popular mutation tools. The experimental result indicates that TauMed can improve the classification accuracy of the model effectively, and the quality of augmented images is higher than the other two tools. Its video is at https://www.youtube.com/watch?v=O8W8I7U_eqk and TauMed can be used at http://121.196.124.158:9500/.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133465176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingling Sun, Ting Su, Junxin Li, Zhen Dong, G. Pu, Tao Xie, Z. Su
Android, the most popular mobile system, offers a number of user-configurable system settings (e.g., network, location, and permission) for controlling devices and apps. Even popular, well-tested apps may fail to properly adapt their behaviors to diverse setting changes, thus frustrating their users. However, there exists no effort to systematically investigate such defects. To this end, we conduct the first empirical study to understand the characteristics of these setting-related defects (in short as "setting defects"), which reside in apps and are triggered by system setting changes. We devote substantial manual effort (over three person-months) to analyze 1,074 setting defects from 180 popular apps on GitHub. We investigate their impact, root causes, and consequences. We find that setting defects have a wide, diverse impact on apps' correctness, and the majority of these defects (≈70.7%) cause non-crash (logic) failures, and thus could not be automatically detected by existing app testing techniques due to the lack of strong test oracles. Motivated and guided by our study, we propose setting-wise metamorphic fuzzing, the first automated testing approach to effectively detect setting defects without explicit oracles. Our key insight is that an app's behavior should, in most cases, remain consistent if a given setting is changed and later properly restored, or exhibit expected differences if not restored. We realize our approach in SetDroid, an automated, end-to-end GUI testing tool, for detecting both crash and non-crash setting defects. SetDroid has been evaluated on 26 popular, open-source apps and detected 42 unique, previously unknown setting defects in 24 apps. To date, 33 have been confirmed and 21 fixed. We also apply SetDroid on five highly popular industrial apps, namely WeChat, QQMail, TikTok, CapCut, and AlipayHK, all of which each have billions of monthly active users. SetDroid successfully detects 17 previously unknown setting defects in these apps' latest releases, and all defects have been confirmed and fixed by the app vendors. The majority of SetDroid-detected defects (49 out of 59) cause non-crash failures, which could not be detected by existing testing tools (as our evaluation confirms). These results demonstrate SetDroid's strong effectiveness and practicality.
{"title":"Understanding and finding system setting-related defects in Android apps","authors":"Jingling Sun, Ting Su, Junxin Li, Zhen Dong, G. Pu, Tao Xie, Z. Su","doi":"10.1145/3460319.3464806","DOIUrl":"https://doi.org/10.1145/3460319.3464806","url":null,"abstract":"Android, the most popular mobile system, offers a number of user-configurable system settings (e.g., network, location, and permission) for controlling devices and apps. Even popular, well-tested apps may fail to properly adapt their behaviors to diverse setting changes, thus frustrating their users. However, there exists no effort to systematically investigate such defects. To this end, we conduct the first empirical study to understand the characteristics of these setting-related defects (in short as \"setting defects\"), which reside in apps and are triggered by system setting changes. We devote substantial manual effort (over three person-months) to analyze 1,074 setting defects from 180 popular apps on GitHub. We investigate their impact, root causes, and consequences. We find that setting defects have a wide, diverse impact on apps' correctness, and the majority of these defects (≈70.7%) cause non-crash (logic) failures, and thus could not be automatically detected by existing app testing techniques due to the lack of strong test oracles. Motivated and guided by our study, we propose setting-wise metamorphic fuzzing, the first automated testing approach to effectively detect setting defects without explicit oracles. Our key insight is that an app's behavior should, in most cases, remain consistent if a given setting is changed and later properly restored, or exhibit expected differences if not restored. We realize our approach in SetDroid, an automated, end-to-end GUI testing tool, for detecting both crash and non-crash setting defects. SetDroid has been evaluated on 26 popular, open-source apps and detected 42 unique, previously unknown setting defects in 24 apps. To date, 33 have been confirmed and 21 fixed. We also apply SetDroid on five highly popular industrial apps, namely WeChat, QQMail, TikTok, CapCut, and AlipayHK, all of which each have billions of monthly active users. SetDroid successfully detects 17 previously unknown setting defects in these apps' latest releases, and all defects have been confirmed and fixed by the app vendors. The majority of SetDroid-detected defects (49 out of 59) cause non-crash failures, which could not be detected by existing testing tools (as our evaluation confirms). These results demonstrate SetDroid's strong effectiveness and practicality.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123098170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shisong Qin, Chao Zhang, Kaixiang Chen, Zheming Li
ARM has become the most competitive processor architecture. Many platforms or tools are developed to execute or analyze ARM instructions, including various commercial CPUs, emulators, and binary analysis tools. However, they have deviations when processing the same ARM instructions, and little attention has been paid to systematically analyze such semantic deviations, not to mention the security implications of such deviations. In this paper, we conduct an empirical study on the ARM Instruction Semantic Deviation (ISDev) issue. First, we classify this issue into several categories and analyze the security implications behind them. Then, we further demonstrate several novel attacks which utilize the ISDev issue, including stealthy targeted attacks and targeted defense evasion. Such attacks could exploit the semantic deviations to generate malware that is specific to certain platforms or able to detect and bypass certain detection solutions. We have developed a framework iDEV to systematically explore the ISDev issue in existing ARM instructions processing tools and platforms via differential testing. We have evaluated iDEV on four hardware devices, the QEMU emulator, and five disassemblers which could process the ARMv7-A instruction set. The evaluation results show that, over six million instructions could cause dynamic executors (i.e., CPUs and QEMU) to present different runtime behaviors, and over eight million instructions could cause static disassemblers yielding different decoding results, and over one million instructions cause inconsistency between dynamic executors and static disassemblers. After analyzing the root causes of each type of deviation, we point out they are mostly due to ARM unpredictable instructions and program defects.
{"title":"iDEV: exploring and exploiting semantic deviations in ARM instruction processing","authors":"Shisong Qin, Chao Zhang, Kaixiang Chen, Zheming Li","doi":"10.1145/3460319.3464842","DOIUrl":"https://doi.org/10.1145/3460319.3464842","url":null,"abstract":"ARM has become the most competitive processor architecture. Many platforms or tools are developed to execute or analyze ARM instructions, including various commercial CPUs, emulators, and binary analysis tools. However, they have deviations when processing the same ARM instructions, and little attention has been paid to systematically analyze such semantic deviations, not to mention the security implications of such deviations. In this paper, we conduct an empirical study on the ARM Instruction Semantic Deviation (ISDev) issue. First, we classify this issue into several categories and analyze the security implications behind them. Then, we further demonstrate several novel attacks which utilize the ISDev issue, including stealthy targeted attacks and targeted defense evasion. Such attacks could exploit the semantic deviations to generate malware that is specific to certain platforms or able to detect and bypass certain detection solutions. We have developed a framework iDEV to systematically explore the ISDev issue in existing ARM instructions processing tools and platforms via differential testing. We have evaluated iDEV on four hardware devices, the QEMU emulator, and five disassemblers which could process the ARMv7-A instruction set. The evaluation results show that, over six million instructions could cause dynamic executors (i.e., CPUs and QEMU) to present different runtime behaviors, and over eight million instructions could cause static disassemblers yielding different decoding results, and over one million instructions cause inconsistency between dynamic executors and static disassemblers. After analyzing the root causes of each type of deviation, we point out they are mostly due to ARM unpredictable instructions and program defects.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133655051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep Learning (DL) solutions are increasingly adopted, but how to test them remains a major open research problem. Existing and new testing techniques have been proposed for and adapted to DL systems, including mutation testing. However, no approach has investigated the possibility to simulate the effects of real DL faults by means of mutation operators. We have defined 35 DL mutation operators relying on 3 empirical studies about real faults in DL systems. We followed a systematic process to extract the mutation operators from the existing fault taxonomies, with a formal phase of conflict resolution in case of disagreement. We have implemented 24 of these DL mutation operators into DeepCrime, the first source-level pre-training mutation tool based on real DL faults. We have assessed our mutation operators to understand their characteristics: whether they produce interesting, i.e., killable but not trivial, mutations. Then, we have compared the sensitivity of our tool to the changes in the quality of test data with that of DeepMutation++, an existing post-training DL mutation tool.
{"title":"DeepCrime: mutation testing of deep learning systems based on real faults","authors":"Nargiz Humbatova, Gunel Jahangirova, P. Tonella","doi":"10.1145/3460319.3464825","DOIUrl":"https://doi.org/10.1145/3460319.3464825","url":null,"abstract":"Deep Learning (DL) solutions are increasingly adopted, but how to test them remains a major open research problem. Existing and new testing techniques have been proposed for and adapted to DL systems, including mutation testing. However, no approach has investigated the possibility to simulate the effects of real DL faults by means of mutation operators. We have defined 35 DL mutation operators relying on 3 empirical studies about real faults in DL systems. We followed a systematic process to extract the mutation operators from the existing fault taxonomies, with a formal phase of conflict resolution in case of disagreement. We have implemented 24 of these DL mutation operators into DeepCrime, the first source-level pre-training mutation tool based on real DL faults. We have assessed our mutation operators to understand their characteristics: whether they produce interesting, i.e., killable but not trivial, mutations. Then, we have compared the sensitivity of our tool to the changes in the quality of test data with that of DeepMutation++, an existing post-training DL mutation tool.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123585416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning (DL) systems are increasingly deployed for autonomous decision-making in a wide range of applications. Apart from the robustness and safety, fairness is also an important property that a well-designed DL system should have. To evaluate and improve individual fairness of a model, systematic test case generation for identifying individual discriminatory instances in the input space is essential. In this paper, we propose a framework EIDIG for efficiently discovering individual fairness violation. Our technique combines a global generation phase for rapidly generating a set of diverse discriminatory seeds with a local generation phase for generating as many individual discriminatory instances as possible around these seeds under the guidance of the gradient of the model output. In each phase, prior information at successive iterations is fully exploited to accelerate convergence of iterative optimization or reduce frequency of gradient calculation. Our experimental results show that, on average, our approach EIDIG generates 19.11% more individual discriminatory instances with a speedup of 121.49% when compared with the state-of-the-art method and mitigates individual discrimination by 80.03% with a limited accuracy loss after retraining.
{"title":"Efficient white-box fairness testing through gradient search","authors":"Lingfeng Zhang, Yueling Zhang, M. Zhang","doi":"10.1145/3460319.3464820","DOIUrl":"https://doi.org/10.1145/3460319.3464820","url":null,"abstract":"Deep learning (DL) systems are increasingly deployed for autonomous decision-making in a wide range of applications. Apart from the robustness and safety, fairness is also an important property that a well-designed DL system should have. To evaluate and improve individual fairness of a model, systematic test case generation for identifying individual discriminatory instances in the input space is essential. In this paper, we propose a framework EIDIG for efficiently discovering individual fairness violation. Our technique combines a global generation phase for rapidly generating a set of diverse discriminatory seeds with a local generation phase for generating as many individual discriminatory instances as possible around these seeds under the guidance of the gradient of the model output. In each phase, prior information at successive iterations is fully exploited to accelerate convergence of iterative optimization or reduce frequency of gradient calculation. Our experimental results show that, on average, our approach EIDIG generates 19.11% more individual discriminatory instances with a speedup of 121.49% when compared with the state-of-the-art method and mitigates individual discrimination by 80.03% with a limited accuracy loss after retraining.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129305923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the tremendous advancement of recurrent neural networks(RNN), dialogue systems have achieved significant development. Many RNN-driven dialogue systems, such as Siri, Google Home, and Alexa, have been deployed to assist various tasks. However, accompanying this outstanding performance, RNN-driven dialogue systems, which are essentially a kind of software, could also produce erroneous behaviors and result in massive losses. Meanwhile, the complexity and intractability of RNN models that power the dialogue systems make their testing challenging. In this paper, we design and implement DialTest, the first RNN-driven dialogue system testing tool. DialTest employs a series of transformation operators to make realistic changes on seed data while preserving their oracle information properly. To improve the efficiency of detecting faults, DialTest further adopts Gini impurity to guide the test generation process. We conduct extensive experiments to validate DialTest. We first experiment it on two fundamental tasks, i.e., intent detection and slot filling, of natural language understanding. The experiment results show that DialTest can effectively detect hundreds of erroneous behaviors for different RNN-driven natural language understanding (NLU) modules of dialogue systems and improve their accuracy via retraining with the generated data. Further, we conduct a case study on an industrial dialogue system to investigate the performance of DialTest under the real usage scenario. The study shows DialTest can detect errors and improve the robustness of RNN-driven dialogue systems effectively.
{"title":"DialTest: automated testing for recurrent-neural-network-driven dialogue systems","authors":"Zixi Liu, Yang Feng, Zhenyu Chen","doi":"10.1145/3460319.3464829","DOIUrl":"https://doi.org/10.1145/3460319.3464829","url":null,"abstract":"With the tremendous advancement of recurrent neural networks(RNN), dialogue systems have achieved significant development. Many RNN-driven dialogue systems, such as Siri, Google Home, and Alexa, have been deployed to assist various tasks. However, accompanying this outstanding performance, RNN-driven dialogue systems, which are essentially a kind of software, could also produce erroneous behaviors and result in massive losses. Meanwhile, the complexity and intractability of RNN models that power the dialogue systems make their testing challenging. In this paper, we design and implement DialTest, the first RNN-driven dialogue system testing tool. DialTest employs a series of transformation operators to make realistic changes on seed data while preserving their oracle information properly. To improve the efficiency of detecting faults, DialTest further adopts Gini impurity to guide the test generation process. We conduct extensive experiments to validate DialTest. We first experiment it on two fundamental tasks, i.e., intent detection and slot filling, of natural language understanding. The experiment results show that DialTest can effectively detect hundreds of erroneous behaviors for different RNN-driven natural language understanding (NLU) modules of dialogue systems and improve their accuracy via retraining with the generated data. Further, we conduct a case study on an industrial dialogue system to investigate the performance of DialTest under the real usage scenario. The study shows DialTest can detect errors and improve the robustness of RNN-driven dialogue systems effectively.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121091030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The correct compilation of atomic-action concurrency is vital now that multicore processors are ubiquitous. Despite much recent work on automated compiler testing, little existing tooling can test how real-world compilers handle compilation of atomic-action code. We demonstrate C4, a tool for exploring the concurrency behaviour of real-world C compilers such as GCC and LLVM. C4 automates a workflow based on generating, fuzzing, and executing litmus tests. So far, C4 has found two new control-flow bugs in GCC and IBM XL, and reproduced two historic concurrency bugs in GCC 4.
{"title":"C4: the C compiler concurrency checker","authors":"Matt Windsor, A. Donaldson, John Wickerson","doi":"10.1145/3460319.3469079","DOIUrl":"https://doi.org/10.1145/3460319.3469079","url":null,"abstract":"The correct compilation of atomic-action concurrency is vital now that multicore processors are ubiquitous. Despite much recent work on automated compiler testing, little existing tooling can test how real-world compilers handle compilation of atomic-action code. We demonstrate C4, a tool for exploring the concurrency behaviour of real-world C compilers such as GCC and LLVM. C4 automates a workflow based on generating, fuzzing, and executing litmus tests. So far, C4 has found two new control-flow bugs in GCC and IBM XL, and reproduced two historic concurrency bugs in GCC 4.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122269614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yueming Wu, Deqing Zou, Wei Yang, Xiang Li, Hai Jin
Android has become the most popular mobile operating system. Correspondingly, an increasing number of Android malware has been developed and spread to steal users’ private information. There exists one type of malware whose benign behaviors are developed to camouflage malicious behaviors. The malicious component occupies a small part of the entire code of the application (app for short), and the malicious part is strongly coupled with the benign part. In this case, the malware may cause false negatives when malware detectors extract features from the entire apps to conduct classification because the malicious features of these apps may be hidden among benign features. Moreover, some previous work aims to divide the entire app into several parts to discover the malicious part. However, the premise of these methods to commence app partition is that the connections between the normal part and the malicious part are weak (repackaged malware). In this paper, we call this type of malware as Android covert malware and generate the first dataset of covert malware. To detect covert malware samples, we first conduct static analysis to extract the function call graphs. Through the deep analysis on call graphs, we observe that although the correlations between the normal part and the malicious part in these graphs are high, the degree of these correlations has a unique range of distribution. Based on the observation, we design a novel system, HomDroid, to detect covert malware by analyzing the homophily of call graphs. We identify the ideal threshold of correlation to distinguish the normal part and the malicious part based on the evaluation results on a dataset of 4,840 benign apps and 3,385 covert malicious apps. According to our evaluation results, HomDroid is capable of detecting 96.8% of covert malware while the False Negative Rates of another four state-of-the-art systems (PerDroid, Drebin, MaMaDroid, and IntDroid) are 30.7%, 16.3%, 15.2%, and 10.4%, respectively.
{"title":"HomDroid: detecting Android covert malware by social-network homophily analysis","authors":"Yueming Wu, Deqing Zou, Wei Yang, Xiang Li, Hai Jin","doi":"10.1145/3460319.3464833","DOIUrl":"https://doi.org/10.1145/3460319.3464833","url":null,"abstract":"Android has become the most popular mobile operating system. Correspondingly, an increasing number of Android malware has been developed and spread to steal users’ private information. There exists one type of malware whose benign behaviors are developed to camouflage malicious behaviors. The malicious component occupies a small part of the entire code of the application (app for short), and the malicious part is strongly coupled with the benign part. In this case, the malware may cause false negatives when malware detectors extract features from the entire apps to conduct classification because the malicious features of these apps may be hidden among benign features. Moreover, some previous work aims to divide the entire app into several parts to discover the malicious part. However, the premise of these methods to commence app partition is that the connections between the normal part and the malicious part are weak (repackaged malware). In this paper, we call this type of malware as Android covert malware and generate the first dataset of covert malware. To detect covert malware samples, we first conduct static analysis to extract the function call graphs. Through the deep analysis on call graphs, we observe that although the correlations between the normal part and the malicious part in these graphs are high, the degree of these correlations has a unique range of distribution. Based on the observation, we design a novel system, HomDroid, to detect covert malware by analyzing the homophily of call graphs. We identify the ideal threshold of correlation to distinguish the normal part and the malicious part based on the evaluation results on a dataset of 4,840 benign apps and 3,385 covert malicious apps. According to our evaluation results, HomDroid is capable of detecting 96.8% of covert malware while the False Negative Rates of another four state-of-the-art systems (PerDroid, Drebin, MaMaDroid, and IntDroid) are 30.7%, 16.3%, 15.2%, and 10.4%, respectively.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124524546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}