Pub Date : 2022-04-01DOI: 10.1109/ICSTW55395.2022.00059
Q. Sarhan, T. Gergely, Árpád Beszédes
In Spectrum-Based Fault Localization (SBFL), when some failing test cases indicate a bug, a suspicion score for each program element (e.g., statement, method, or class) is calculated using a risk evaluation formula based on basic statistics (e.g., covering/not covering program element in passing/failing test) extracted from test coverage and test results. The elements are then ranked from most suspicious to least suspicious based on their scores. The elements with the highest rank are believed to have the highest probability of being faulty, thus, this light-weight automated technique aids developers to find the bug earlier. Several SBFL formulas were proposed in the literature, but the number of possible formulas is infinite. Previously, experiments were conducted to automatically search new formulas (e.g., using genetic algorithms). However, no systematic search for new formulas were reported in the literature. In this paper, we do so by examining existing formulas, defining formula structure templates, generating formulas automatically (including already proposed ones), and comparing them to each other. Experiments to evaluate the generated formulas were conducted on Defects4J.
{"title":"New Ranking Formulas to Improve Spectrum Based Fault Localization Via Systematic Search","authors":"Q. Sarhan, T. Gergely, Árpád Beszédes","doi":"10.1109/ICSTW55395.2022.00059","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00059","url":null,"abstract":"In Spectrum-Based Fault Localization (SBFL), when some failing test cases indicate a bug, a suspicion score for each program element (e.g., statement, method, or class) is calculated using a risk evaluation formula based on basic statistics (e.g., covering/not covering program element in passing/failing test) extracted from test coverage and test results. The elements are then ranked from most suspicious to least suspicious based on their scores. The elements with the highest rank are believed to have the highest probability of being faulty, thus, this light-weight automated technique aids developers to find the bug earlier. Several SBFL formulas were proposed in the literature, but the number of possible formulas is infinite. Previously, experiments were conducted to automatically search new formulas (e.g., using genetic algorithms). However, no systematic search for new formulas were reported in the literature. In this paper, we do so by examining existing formulas, defining formula structure templates, generating formulas automatically (including already proposed ones), and comparing them to each other. Experiments to evaluate the generated formulas were conducted on Defects4J.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115721257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-01DOI: 10.1109/ICSTW55395.2022.00055
Mikael Ebrahimi Salari, Eduard Paul Enoiu, W. Afzal, C. Seceleanu
Programmable Logic Controllers are computer devices often used in industrial control systems as primary components that provide operational control and monitoring. The software running on these controllers is usually programmed in an Integrated Development Environment using a graphical or textual language defined in the IEC 61131-3 standard. Although traditionally, engineers have tested programmable logic controllers’ software manually, test automation is being adopted during development in various compliant development environments. However, recent studies indicate that choosing a suitable test automation framework is not trivial and hinders industrial applicability. In this paper, we tackle the problem of choosing a test automation framework for testing programmable logic controllers, by focusing on the COntroller DEvelopment SYStem (CODESYS) development environment. CODESYS is deemed popular for device-independent programming according to IEC 61131-3. We explore the CODESYS-supported test automation frameworks through a grey literature review and identify the essential criteria for choosing such a test automation framework. We validate these criteria with an industry practitioner and compare the resulting test automation frameworks in an industrial case study. Next, we summarize the steps for selecting a test automation framework and the identification of 29 different criteria for test automation framework evaluation. This study shows that CODESYS Test Manager and CoUnit are mentioned the most in the grey literature review results. The industrial case study aims to increase the know-how in automated testing of programmable logic controllers and help other researchers and practitioners identify the right framework for test automation in an industrial context.
可编程逻辑控制器是一种计算机设备,通常用于工业控制系统中,作为提供操作控制和监视的主要组件。在这些控制器上运行的软件通常在集成开发环境中使用IEC 61131-3标准中定义的图形或文本语言进行编程。虽然传统上,工程师都是手动测试可编程逻辑控制器软件,但在各种兼容的开发环境中,测试自动化正在被采用。然而,最近的研究表明,选择一个合适的测试自动化框架并不是微不足道的,而且会阻碍工业应用。在本文中,我们通过关注控制器开发系统(CODESYS)开发环境,解决了为测试可编程逻辑控制器选择测试自动化框架的问题。根据IEC 61131-3, CODESYS被认为是流行的与设备无关的编程。我们通过灰色文献回顾来探索codesys支持的测试自动化框架,并确定选择这样一个测试自动化框架的基本标准。我们与行业从业者一起验证这些标准,并在行业案例研究中比较结果测试自动化框架。接下来,我们总结了选择测试自动化框架的步骤,以及为测试自动化框架评估识别29个不同的标准。本研究表明,CODESYS Test Manager和CoUnit在灰色文献综述结果中被提及最多。工业案例研究旨在提高可编程逻辑控制器自动化测试的专业知识,并帮助其他研究人员和从业者确定工业环境中测试自动化的正确框架。
{"title":"Choosing a Test Automation Framework for Programmable Logic Controllers in CODESYS Development Environment","authors":"Mikael Ebrahimi Salari, Eduard Paul Enoiu, W. Afzal, C. Seceleanu","doi":"10.1109/ICSTW55395.2022.00055","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00055","url":null,"abstract":"Programmable Logic Controllers are computer devices often used in industrial control systems as primary components that provide operational control and monitoring. The software running on these controllers is usually programmed in an Integrated Development Environment using a graphical or textual language defined in the IEC 61131-3 standard. Although traditionally, engineers have tested programmable logic controllers’ software manually, test automation is being adopted during development in various compliant development environments. However, recent studies indicate that choosing a suitable test automation framework is not trivial and hinders industrial applicability. In this paper, we tackle the problem of choosing a test automation framework for testing programmable logic controllers, by focusing on the COntroller DEvelopment SYStem (CODESYS) development environment. CODESYS is deemed popular for device-independent programming according to IEC 61131-3. We explore the CODESYS-supported test automation frameworks through a grey literature review and identify the essential criteria for choosing such a test automation framework. We validate these criteria with an industry practitioner and compare the resulting test automation frameworks in an industrial case study. Next, we summarize the steps for selecting a test automation framework and the identification of 29 different criteria for test automation framework evaluation. This study shows that CODESYS Test Manager and CoUnit are mentioned the most in the grey literature review results. The industrial case study aims to increase the know-how in automated testing of programmable logic controllers and help other researchers and practitioners identify the right framework for test automation in an industrial context.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124282706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-01DOI: 10.1109/ICSTW55395.2022.00038
Seungjoon Chung, S. Yoo
Mutation testing aims to ensure that a test suite is capable of detecting real faults, by checking whether they can reveal (i.e., kill) small and arbitrary lexical changes made to the program (i.e., mutants). Some of these arbitrary changes may result in a mutant that is syntactically different but is semantically equivalent to the original program under test: such mutants are called equivalent mutants. Since program equivalence is undecidable in general, equivalent mutants pose a serious challenge to mutation testing. Given an unkilled mutant, it is not possible to automatically decide whether the cause is the weakness of test cases or the equivalence of the mutant. Recently machine learning has been adopted to train binary classification models for mutant equivalence. However, training such classification models requires a pool of equivalent mutants, the labelling for which involves a significant amount of human investigation. In this paper, we introduce two techniques that can be used to augment the equivalent mutant benchmarks. First, we propose a symbolic execution-based validation of mutant equivalence, instead of manual classification. Second, we introduce a synthesis technique for equivalent mutants: for a subset of mutation operators, the technique identifies potential mutation locations that are guaranteed to produce equivalent mutants. We compare these two techniques to MutantBench, a manually labelled equivalent mutant benchmark. For the 19 programs studied, MutantBench contains 462 equivalent mutants, whereas our technique is capable of generating 1,725 equivalent mutants automatically, of which 1,349 are new and unique. We further show that the additional equivalent mutants can lead to more accurate equivalent mutant classification models.
{"title":"Augmenting Equivalent Mutant Dataset Using Symbolic Execution","authors":"Seungjoon Chung, S. Yoo","doi":"10.1109/ICSTW55395.2022.00038","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00038","url":null,"abstract":"Mutation testing aims to ensure that a test suite is capable of detecting real faults, by checking whether they can reveal (i.e., kill) small and arbitrary lexical changes made to the program (i.e., mutants). Some of these arbitrary changes may result in a mutant that is syntactically different but is semantically equivalent to the original program under test: such mutants are called equivalent mutants. Since program equivalence is undecidable in general, equivalent mutants pose a serious challenge to mutation testing. Given an unkilled mutant, it is not possible to automatically decide whether the cause is the weakness of test cases or the equivalence of the mutant. Recently machine learning has been adopted to train binary classification models for mutant equivalence. However, training such classification models requires a pool of equivalent mutants, the labelling for which involves a significant amount of human investigation. In this paper, we introduce two techniques that can be used to augment the equivalent mutant benchmarks. First, we propose a symbolic execution-based validation of mutant equivalence, instead of manual classification. Second, we introduce a synthesis technique for equivalent mutants: for a subset of mutation operators, the technique identifies potential mutation locations that are guaranteed to produce equivalent mutants. We compare these two techniques to MutantBench, a manually labelled equivalent mutant benchmark. For the 19 programs studied, MutantBench contains 462 equivalent mutants, whereas our technique is capable of generating 1,725 equivalent mutants automatically, of which 1,349 are new and unique. We further show that the additional equivalent mutants can lead to more accurate equivalent mutant classification models.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126821066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-17DOI: 10.1109/ICSTW55395.2022.00017
Vaclav Rechtberger, Miroslav Bures, Bestoun S. Ahmed, Y. Belkhier, Jiří Néma, H. Schvach
Model-based Testing (MBT) is an effective approach for testing when parts of a system-under-test have the characteristics of a finite state machine (FSM). Despite various strategies in the literature on this topic, little work exists to handle special testing situations. More specifically, when concurrently: (1) the test paths can start and end only in defined states of the FSM, (2) a prioritization mechanism that requires only defined states and transitions of the FSM to be visited by test cases is required, and (3) the test paths must be in a given length range, not necessarily of explicit uniform length. This paper presents a test generation strategy that satisfies all these requirements. A concurrent combination of these requirements is highly practical for real industrial testing. Six variants of possible algorithms to implement this strategy are described. Using a mixture of 180 problem instances from real automotive and defense projects and artificially generated FSMs, all variants are compared with a baseline strategy based on an established N-switch coverage concept modification. Various properties of the generated test paths and their potential to activate fictional defects defined in FSMs are evaluated. The presented strategy outperforms the baseline in most problem configurations. Out of the six analyzed variants, three give the best results even though a universal best performer is hard to identify. Depending on the application of the FSM, the strategy and evaluation presented in this paper are applicable both in testing functional and non-functional software requirements.
{"title":"Prioritized Variable-length Test Cases Generation for Finite State Machines","authors":"Vaclav Rechtberger, Miroslav Bures, Bestoun S. Ahmed, Y. Belkhier, Jiří Néma, H. Schvach","doi":"10.1109/ICSTW55395.2022.00017","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00017","url":null,"abstract":"Model-based Testing (MBT) is an effective approach for testing when parts of a system-under-test have the characteristics of a finite state machine (FSM). Despite various strategies in the literature on this topic, little work exists to handle special testing situations. More specifically, when concurrently: (1) the test paths can start and end only in defined states of the FSM, (2) a prioritization mechanism that requires only defined states and transitions of the FSM to be visited by test cases is required, and (3) the test paths must be in a given length range, not necessarily of explicit uniform length. This paper presents a test generation strategy that satisfies all these requirements. A concurrent combination of these requirements is highly practical for real industrial testing. Six variants of possible algorithms to implement this strategy are described. Using a mixture of 180 problem instances from real automotive and defense projects and artificially generated FSMs, all variants are compared with a baseline strategy based on an established N-switch coverage concept modification. Various properties of the generated test paths and their potential to activate fictional defects defined in FSMs are evaluated. The presented strategy outperforms the baseline in most problem configurations. Out of the six analyzed variants, three give the best results even though a universal best performer is hard to identify. Depending on the application of the FSM, the strategy and evaluation presented in this paper are applicable both in testing functional and non-functional software requirements.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132883389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-17DOI: 10.1109/ICSTW55395.2022.00044
Vaclav Rechtberger, Miroslav Bures, Bestoun S. Ahmed
Test Coverage criteria are an essential concept for test engineers when generating the test cases from a System Under Test model. They are routinely used in test case generation for user interfaces, middleware, and back-end system parts for software, electronics, or Internet of Things (IoT) systems. Test Coverage criteria define the number of actions or combinations by which a system is tested, informally determining a potential "strength" of a test set. As no previous study summarized all commonly used test coverage criteria for Finite State Machines and comprehensively discussed them regarding their subsumption, equivalence, or non-comparability, this paper provides this overview. In this study, 14 most common test coverage criteria and seven of their synonyms for Finite State Machines defined via a directed graph are summarized and compared. The results give researchers and industry testing engineers a helpful overview when setting a software-based or IoT system test strategy.
{"title":"Overview of Test Coverage Criteria for Test Case Generation from Finite State Machines Modelled as Directed Graphs","authors":"Vaclav Rechtberger, Miroslav Bures, Bestoun S. Ahmed","doi":"10.1109/ICSTW55395.2022.00044","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00044","url":null,"abstract":"Test Coverage criteria are an essential concept for test engineers when generating the test cases from a System Under Test model. They are routinely used in test case generation for user interfaces, middleware, and back-end system parts for software, electronics, or Internet of Things (IoT) systems. Test Coverage criteria define the number of actions or combinations by which a system is tested, informally determining a potential \"strength\" of a test set. As no previous study summarized all commonly used test coverage criteria for Finite State Machines and comprehensively discussed them regarding their subsumption, equivalence, or non-comparability, this paper provides this overview. In this study, 14 most common test coverage criteria and seven of their synonyms for Finite State Machines defined via a directed graph are summarized and compared. The results give researchers and industry testing engineers a helpful overview when setting a software-based or IoT system test strategy.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125525998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-07DOI: 10.1109/ICSTW55395.2022.00039
Renzo Degiovanni, Mike Papadakis
We introduce µBert, a mutation testing tool that uses a pre-trained language model (CodeBERT) to generate mutants. This is done by masking a token from the expression given as input and using CodeBERT to predict it. Thus, the mutants are generated by replacing the masked tokens with the predicted ones. We evaluate µBert on 40 real faults from Defects4J and show that it can detect 27 out of the 40 faults, while the baseline (PiTest) detects 26 of them. We also show that µBert can be 2 times more cost-effective than PiTest, when the same number of mutants are analysed. Additionally, we evaluate the impact of µBert’s mutants when used by program assertion inference techniques, and show that they can help in producing better specifications. Finally, we discuss about the quality and naturalness of some interesting mutants produced by µBert during our experimental evaluation.
{"title":"µBert: Mutation Testing using Pre-Trained Language Models","authors":"Renzo Degiovanni, Mike Papadakis","doi":"10.1109/ICSTW55395.2022.00039","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00039","url":null,"abstract":"We introduce µBert, a mutation testing tool that uses a pre-trained language model (CodeBERT) to generate mutants. This is done by masking a token from the expression given as input and using CodeBERT to predict it. Thus, the mutants are generated by replacing the masked tokens with the predicted ones. We evaluate µBert on 40 real faults from Defects4J and show that it can detect 27 out of the 40 faults, while the baseline (PiTest) detects 26 of them. We also show that µBert can be 2 times more cost-effective than PiTest, when the same number of mutants are analysed. Additionally, we evaluate the impact of µBert’s mutants when used by program assertion inference techniques, and show that they can help in producing better specifications. Finally, we discuss about the quality and naturalness of some interesting mutants produced by µBert during our experimental evaluation.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131216324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-24DOI: 10.1109/ICSTW55395.2022.00035
M. K. Ahuja, A. Gotlieb, Helge Spieker
Deep Learning (DL) has revolutionized the capabilities of vision-based systems (VBS) in critical applications such as autonomous driving, robotic surgery, critical infrastructure surveillance, air and maritime traffic control, etc. By analyzing images, voice, videos, or any type of complex signals, DL has considerably increased the situation awareness of these systems. At the same time, while relying more and more on trained DL models, the reliability and robustness of VBS have been challenged and it has become crucial to test thoroughly these models to assess their capabilities and potential errors. To discover faults in DL models, existing software testing methods have been adapted and refined accordingly. In this article, we provide an overview of these software testing methods, namely differential, metamorphic, mutation, and combinatorial testing, as well as adversarial perturbation testing and review some challenges in their deployment for boosting perception systems used in VBS. We also provide a first experimental comparative study on a classical benchmark used in VBS and discuss its results.
{"title":"Testing Deep Learning Models: A First Comparative Study of Multiple Testing Techniques","authors":"M. K. Ahuja, A. Gotlieb, Helge Spieker","doi":"10.1109/ICSTW55395.2022.00035","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00035","url":null,"abstract":"Deep Learning (DL) has revolutionized the capabilities of vision-based systems (VBS) in critical applications such as autonomous driving, robotic surgery, critical infrastructure surveillance, air and maritime traffic control, etc. By analyzing images, voice, videos, or any type of complex signals, DL has considerably increased the situation awareness of these systems. At the same time, while relying more and more on trained DL models, the reliability and robustness of VBS have been challenged and it has become crucial to test thoroughly these models to assess their capabilities and potential errors. To discover faults in DL models, existing software testing methods have been adapted and refined accordingly. In this article, we provide an overview of these software testing methods, namely differential, metamorphic, mutation, and combinatorial testing, as well as adversarial perturbation testing and review some challenges in their deployment for boosting perception systems used in VBS. We also provide a first experimental comparative study on a classical benchmark used in VBS and discuss its results.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133946264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-18DOI: 10.1109/ICSTW55395.2022.00056
M. Mäntylä, M. Varela, Shayan Hashemi
As stability testing execution logs can be very long, software engineers need help in locating anomalous events. We develop and evaluate two models for scoring individual log-events for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM (Long short-term memory). Both are trained on normal log sequences only. We evaluate the models with long log sequences of Android stability testing in our company case and with short log sequences from HDFS (Hadoop Distributed File System) public dataset. We evaluate next event prediction accuracy and computational efficiency. The LSTM model is more accurate in stability testing logs (0.848 vs 0.865), whereas in HDFS logs the N-Gram is slightly more accurate (0.904 vs 0.900). The N-Gram model has far superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found limited benefits in doing so. However, future work should consider whether our finding holds with different LSTM-model hyper-parameters, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.
由于稳定性测试执行日志可能非常长,软件工程师需要帮助来定位异常事件。我们开发并评估了两种模型,用于对单个日志事件进行异常评分,即N-Gram模型和具有LSTM(长短期记忆)的深度学习模型。两者都只在正常的对数序列上训练。在我们公司的案例中,我们使用Android稳定性测试的长日志序列和来自HDFS (Hadoop分布式文件系统)公共数据集的短日志序列来评估模型。对下一个事件的预测精度和计算效率进行了评价。LSTM模型在稳定性测试日志中更准确(0.848 vs 0.865),而在HDFS日志中N-Gram稍微更准确(0.904 vs 0.900)。与Deep模型相比,N-Gram模型的计算效率要高得多(4到13秒vs 16分钟到近4小时),使其成为我们案例公司的首选。对异常的单个日志事件进行评分似乎是对失败测试用例的根本原因分析的一个很好的帮助,并且我们的案例公司计划将其添加到其在线服务中。尽管最近在软件系统异常检测中使用深度学习激增,但我们发现这样做的好处有限。然而,未来的工作应该考虑我们的发现是否适用于不同的LSTM模型超参数、其他数据集和其他深度学习方法,这些方法比基于LSTM的模型承诺更好的准确性和计算效率。
{"title":"Pinpointing Anomaly Events in Logs from Stability Testing – N-Grams vs. Deep-Learning","authors":"M. Mäntylä, M. Varela, Shayan Hashemi","doi":"10.1109/ICSTW55395.2022.00056","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00056","url":null,"abstract":"As stability testing execution logs can be very long, software engineers need help in locating anomalous events. We develop and evaluate two models for scoring individual log-events for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM (Long short-term memory). Both are trained on normal log sequences only. We evaluate the models with long log sequences of Android stability testing in our company case and with short log sequences from HDFS (Hadoop Distributed File System) public dataset. We evaluate next event prediction accuracy and computational efficiency. The LSTM model is more accurate in stability testing logs (0.848 vs 0.865), whereas in HDFS logs the N-Gram is slightly more accurate (0.904 vs 0.900). The N-Gram model has far superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found limited benefits in doing so. However, future work should consider whether our finding holds with different LSTM-model hyper-parameters, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129603390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-28DOI: 10.1109/ICSTW55395.2022.00031
Tyler Cody, Erin Lanus, Daniel D. Doyle, Laura J. Freeman
This paper demonstrates the systematic use of combinatorial coverage for selecting and characterizing test and training sets for machine learning models. The presented work adapts combinatorial interaction testing, which has been successfully leveraged in identifying faults in software testing, to characterize data used in machine learning. The MNIST hand-written digits data is used to demonstrate that combinatorial coverage can be used to select test sets that stress machine learning model performance, to select training sets that lead to robust model performance, and to select data for fine-tuning models to new domains. Thus, the results posit combinatorial coverage as a holistic approach to training and testing for machine learning. In contrast to prior work which has focused on the use of coverage in regard to the internal of neural networks, this paper considers coverage over simple features derived from inputs and outputs. Thus, this paper addresses the case where the supplier of test and training sets for machine learning models does not have intellectual property rights to the models themselves. Finally, the paper addresses prior criticism of combinatorial coverage and provides a rebuttal which advocates the use of coverage metrics in machine learning applications.
{"title":"Systematic Training and Testing for Machine Learning Using Combinatorial Interaction Testing","authors":"Tyler Cody, Erin Lanus, Daniel D. Doyle, Laura J. Freeman","doi":"10.1109/ICSTW55395.2022.00031","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00031","url":null,"abstract":"This paper demonstrates the systematic use of combinatorial coverage for selecting and characterizing test and training sets for machine learning models. The presented work adapts combinatorial interaction testing, which has been successfully leveraged in identifying faults in software testing, to characterize data used in machine learning. The MNIST hand-written digits data is used to demonstrate that combinatorial coverage can be used to select test sets that stress machine learning model performance, to select training sets that lead to robust model performance, and to select data for fine-tuning models to new domains. Thus, the results posit combinatorial coverage as a holistic approach to training and testing for machine learning. In contrast to prior work which has focused on the use of coverage in regard to the internal of neural networks, this paper considers coverage over simple features derived from inputs and outputs. Thus, this paper addresses the case where the supplier of test and training sets for machine learning models does not have intellectual property rights to the models themselves. Finally, the paper addresses prior criticism of combinatorial coverage and provides a rebuttal which advocates the use of coverage metrics in machine learning applications.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"46 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114105167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-27DOI: 10.1109/ICSTW55395.2022.00020
Tanwir Ahmad, D. Truscan, Juri Vain, Ivan Porres
The Internet has become a prime subject to security attacks and intrusions by attackers. These attacks can lead to system malfunction, network breakdown, data corruption or theft. A network intrusion detection system (IDS) is a tool used for identifying unauthorized and malicious behavior by observing the network traffic. State-of-the-art intrusion detection systems are designed to detect an attack by inspecting the complete information about the attack. This means that an IDS would only be able to detect an attack after it has been executed on the system under attack and might have caused damage to the system. In this paper, we propose an end-to-end early intrusion detection system to prevent network attacks before they could cause any more damage to the system under attack while preventing unforeseen downtime and interruption. We employ a deep neural network-based classifier for attack identification. The network is trained in a supervised manner to extract relevant features from raw network traffic data instead of relying on a manual feature selection process used in most related approaches. Further, we introduce a new metric, called earliness, to evaluate how early our proposed approach detects attacks. We have empirically evaluated our approach on the CICIDS2017 dataset. The results show that our approach performed well and attained an overall 0.803 balanced accuracy.
{"title":"Early Detection of Network Attacks Using Deep Learning","authors":"Tanwir Ahmad, D. Truscan, Juri Vain, Ivan Porres","doi":"10.1109/ICSTW55395.2022.00020","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00020","url":null,"abstract":"The Internet has become a prime subject to security attacks and intrusions by attackers. These attacks can lead to system malfunction, network breakdown, data corruption or theft. A network intrusion detection system (IDS) is a tool used for identifying unauthorized and malicious behavior by observing the network traffic. State-of-the-art intrusion detection systems are designed to detect an attack by inspecting the complete information about the attack. This means that an IDS would only be able to detect an attack after it has been executed on the system under attack and might have caused damage to the system. In this paper, we propose an end-to-end early intrusion detection system to prevent network attacks before they could cause any more damage to the system under attack while preventing unforeseen downtime and interruption. We employ a deep neural network-based classifier for attack identification. The network is trained in a supervised manner to extract relevant features from raw network traffic data instead of relying on a manual feature selection process used in most related approaches. Further, we introduce a new metric, called earliness, to evaluate how early our proposed approach detects attacks. We have empirically evaluated our approach on the CICIDS2017 dataset. The results show that our approach performed well and attained an overall 0.803 balanced accuracy.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"376 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129162421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}