2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)最新文献_第5页

New Ranking Formulas to Improve Spectrum Based Fault Localization Via Systematic Search 基于系统搜索改进频谱故障定位的新排序公式

2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)

Pub Date : 2022-04-01 DOI: 10.1109/ICSTW55395.2022.00059

Q. Sarhan, T. Gergely, Árpád Beszédes

In Spectrum-Based Fault Localization (SBFL), when some failing test cases indicate a bug, a suspicion score for each program element (e.g., statement, method, or class) is calculated using a risk evaluation formula based on basic statistics (e.g., covering/not covering program element in passing/failing test) extracted from test coverage and test results. The elements are then ranked from most suspicious to least suspicious based on their scores. The elements with the highest rank are believed to have the highest probability of being faulty, thus, this light-weight automated technique aids developers to find the bug earlier. Several SBFL formulas were proposed in the literature, but the number of possible formulas is infinite. Previously, experiments were conducted to automatically search new formulas (e.g., using genetic algorithms). However, no systematic search for new formulas were reported in the literature. In this paper, we do so by examining existing formulas, defining formula structure templates, generating formulas automatically (including already proposed ones), and comparing them to each other. Experiments to evaluate the generated formulas were conducted on Defects4J.

在基于谱的故障定位(SBFL)中，当一些失败的测试用例指出一个bug时，每个程序元素(例如，语句、方法或类)的怀疑分数是使用基于从测试覆盖率和测试结果中提取的基本统计数据(例如，覆盖/不覆盖通过/失败测试中的程序元素)的风险评估公式来计算的。然后根据这些元素的得分，将它们从最可疑的到最不可疑的进行排序。具有最高级别的元素被认为具有最高的出错概率，因此，这种轻量级的自动化技术可以帮助开发人员更早地找到错误。文献中提出了几种SBFL公式，但可能的公式数量是无限的。以前的实验是自动搜索新公式(例如，使用遗传算法)。然而，在文献中没有系统地搜索新配方的报道。在本文中，我们通过检查现有公式，定义公式结构模板，自动生成公式(包括已经提出的公式)，并将它们相互比较来实现这一点。在Defects4J上进行实验以评估生成的公式。

{"title":"New Ranking Formulas to Improve Spectrum Based Fault Localization Via Systematic Search","authors":"Q. Sarhan, T. Gergely, Árpád Beszédes","doi":"10.1109/ICSTW55395.2022.00059","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00059","url":null,"abstract":"In Spectrum-Based Fault Localization (SBFL), when some failing test cases indicate a bug, a suspicion score for each program element (e.g., statement, method, or class) is calculated using a risk evaluation formula based on basic statistics (e.g., covering/not covering program element in passing/failing test) extracted from test coverage and test results. The elements are then ranked from most suspicious to least suspicious based on their scores. The elements with the highest rank are believed to have the highest probability of being faulty, thus, this light-weight automated technique aids developers to find the bug earlier. Several SBFL formulas were proposed in the literature, but the number of possible formulas is infinite. Previously, experiments were conducted to automatically search new formulas (e.g., using genetic algorithms). However, no systematic search for new formulas were reported in the literature. In this paper, we do so by examining existing formulas, defining formula structure templates, generating formulas automatically (including already proposed ones), and comparing them to each other. Experiments to evaluate the generated formulas were conducted on Defects4J.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115721257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Choosing a Test Automation Framework for Programmable Logic Controllers in CODESYS Development Environment CODESYS开发环境下可编程逻辑控制器测试自动化框架的选择

2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)

Pub Date : 2022-04-01 DOI: 10.1109/ICSTW55395.2022.00055

Mikael Ebrahimi Salari, Eduard Paul Enoiu, W. Afzal, C. Seceleanu

Programmable Logic Controllers are computer devices often used in industrial control systems as primary components that provide operational control and monitoring. The software running on these controllers is usually programmed in an Integrated Development Environment using a graphical or textual language defined in the IEC 61131-3 standard. Although traditionally, engineers have tested programmable logic controllers’ software manually, test automation is being adopted during development in various compliant development environments. However, recent studies indicate that choosing a suitable test automation framework is not trivial and hinders industrial applicability. In this paper, we tackle the problem of choosing a test automation framework for testing programmable logic controllers, by focusing on the COntroller DEvelopment SYStem (CODESYS) development environment. CODESYS is deemed popular for device-independent programming according to IEC 61131-3. We explore the CODESYS-supported test automation frameworks through a grey literature review and identify the essential criteria for choosing such a test automation framework. We validate these criteria with an industry practitioner and compare the resulting test automation frameworks in an industrial case study. Next, we summarize the steps for selecting a test automation framework and the identification of 29 different criteria for test automation framework evaluation. This study shows that CODESYS Test Manager and CoUnit are mentioned the most in the grey literature review results. The industrial case study aims to increase the know-how in automated testing of programmable logic controllers and help other researchers and practitioners identify the right framework for test automation in an industrial context.

可编程逻辑控制器是一种计算机设备，通常用于工业控制系统中，作为提供操作控制和监视的主要组件。在这些控制器上运行的软件通常在集成开发环境中使用IEC 61131-3标准中定义的图形或文本语言进行编程。虽然传统上，工程师都是手动测试可编程逻辑控制器软件，但在各种兼容的开发环境中，测试自动化正在被采用。然而，最近的研究表明，选择一个合适的测试自动化框架并不是微不足道的，而且会阻碍工业应用。在本文中，我们通过关注控制器开发系统(CODESYS)开发环境，解决了为测试可编程逻辑控制器选择测试自动化框架的问题。根据IEC 61131-3, CODESYS被认为是流行的与设备无关的编程。我们通过灰色文献回顾来探索codesys支持的测试自动化框架，并确定选择这样一个测试自动化框架的基本标准。我们与行业从业者一起验证这些标准，并在行业案例研究中比较结果测试自动化框架。接下来，我们总结了选择测试自动化框架的步骤，以及为测试自动化框架评估识别29个不同的标准。本研究表明，CODESYS Test Manager和CoUnit在灰色文献综述结果中被提及最多。工业案例研究旨在提高可编程逻辑控制器自动化测试的专业知识，并帮助其他研究人员和从业者确定工业环境中测试自动化的正确框架。

{"title":"Choosing a Test Automation Framework for Programmable Logic Controllers in CODESYS Development Environment","authors":"Mikael Ebrahimi Salari, Eduard Paul Enoiu, W. Afzal, C. Seceleanu","doi":"10.1109/ICSTW55395.2022.00055","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00055","url":null,"abstract":"Programmable Logic Controllers are computer devices often used in industrial control systems as primary components that provide operational control and monitoring. The software running on these controllers is usually programmed in an Integrated Development Environment using a graphical or textual language defined in the IEC 61131-3 standard. Although traditionally, engineers have tested programmable logic controllers’ software manually, test automation is being adopted during development in various compliant development environments. However, recent studies indicate that choosing a suitable test automation framework is not trivial and hinders industrial applicability. In this paper, we tackle the problem of choosing a test automation framework for testing programmable logic controllers, by focusing on the COntroller DEvelopment SYStem (CODESYS) development environment. CODESYS is deemed popular for device-independent programming according to IEC 61131-3. We explore the CODESYS-supported test automation frameworks through a grey literature review and identify the essential criteria for choosing such a test automation framework. We validate these criteria with an industry practitioner and compare the resulting test automation frameworks in an industrial case study. Next, we summarize the steps for selecting a test automation framework and the identification of 29 different criteria for test automation framework evaluation. This study shows that CODESYS Test Manager and CoUnit are mentioned the most in the grey literature review results. The industrial case study aims to increase the know-how in automated testing of programmable logic controllers and help other researchers and practitioners identify the right framework for test automation in an industrial context.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124282706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Augmenting Equivalent Mutant Dataset Using Symbolic Execution 使用符号执行增强等效突变数据集

2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)

Pub Date : 2022-04-01 DOI: 10.1109/ICSTW55395.2022.00038

Seungjoon Chung, S. Yoo

Mutation testing aims to ensure that a test suite is capable of detecting real faults, by checking whether they can reveal (i.e., kill) small and arbitrary lexical changes made to the program (i.e., mutants). Some of these arbitrary changes may result in a mutant that is syntactically different but is semantically equivalent to the original program under test: such mutants are called equivalent mutants. Since program equivalence is undecidable in general, equivalent mutants pose a serious challenge to mutation testing. Given an unkilled mutant, it is not possible to automatically decide whether the cause is the weakness of test cases or the equivalence of the mutant. Recently machine learning has been adopted to train binary classification models for mutant equivalence. However, training such classification models requires a pool of equivalent mutants, the labelling for which involves a significant amount of human investigation. In this paper, we introduce two techniques that can be used to augment the equivalent mutant benchmarks. First, we propose a symbolic execution-based validation of mutant equivalence, instead of manual classification. Second, we introduce a synthesis technique for equivalent mutants: for a subset of mutation operators, the technique identifies potential mutation locations that are guaranteed to produce equivalent mutants. We compare these two techniques to MutantBench, a manually labelled equivalent mutant benchmark. For the 19 programs studied, MutantBench contains 462 equivalent mutants, whereas our technique is capable of generating 1,725 equivalent mutants automatically, of which 1,349 are new and unique. We further show that the additional equivalent mutants can lead to more accurate equivalent mutant classification models.

突变测试的目的是确保测试套件能够检测到真正的错误，通过检查它们是否能够显示(即，杀死)对程序所做的小而任意的词法更改(即，突变)。这些任意变化中的一些可能会导致语法上不同但语义上与测试中的原始程序等效的突变:这种突变称为等效突变。由于程序等效性通常是不可确定的，因此等效突变对突变测试提出了严峻的挑战。给定一个未杀死的突变体，不可能自动确定原因是测试用例的弱点还是突变体的等效性。近年来，机器学习被用于训练突变等价的二元分类模型。然而，训练这样的分类模型需要一个相等的突变体库，其中的标签涉及大量的人类调查。在本文中，我们介绍了两种可用于增强等效突变基准的技术。首先，我们提出了一种基于符号执行的突变等价验证，而不是手动分类。其次，我们介绍了一种等效突变体的合成技术:对于一个突变算子子集，该技术确定了保证产生等效突变体的潜在突变位置。我们将这两种技术与MutantBench(一种手动标记的等效突变基准)进行比较。在研究的19个程序中，MutantBench包含462个等效突变体，而我们的技术能够自动生成1725个等效突变体，其中1349个是新的和独特的。我们进一步表明，额外的等效突变体可以产生更准确的等效突变体分类模型。

{"title":"Augmenting Equivalent Mutant Dataset Using Symbolic Execution","authors":"Seungjoon Chung, S. Yoo","doi":"10.1109/ICSTW55395.2022.00038","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00038","url":null,"abstract":"Mutation testing aims to ensure that a test suite is capable of detecting real faults, by checking whether they can reveal (i.e., kill) small and arbitrary lexical changes made to the program (i.e., mutants). Some of these arbitrary changes may result in a mutant that is syntactically different but is semantically equivalent to the original program under test: such mutants are called equivalent mutants. Since program equivalence is undecidable in general, equivalent mutants pose a serious challenge to mutation testing. Given an unkilled mutant, it is not possible to automatically decide whether the cause is the weakness of test cases or the equivalence of the mutant. Recently machine learning has been adopted to train binary classification models for mutant equivalence. However, training such classification models requires a pool of equivalent mutants, the labelling for which involves a significant amount of human investigation. In this paper, we introduce two techniques that can be used to augment the equivalent mutant benchmarks. First, we propose a symbolic execution-based validation of mutant equivalence, instead of manual classification. Second, we introduce a synthesis technique for equivalent mutants: for a subset of mutation operators, the technique identifies potential mutation locations that are guaranteed to produce equivalent mutants. We compare these two techniques to MutantBench, a manually labelled equivalent mutant benchmark. For the 19 programs studied, MutantBench contains 462 equivalent mutants, whereas our technique is capable of generating 1,725 equivalent mutants automatically, of which 1,349 are new and unique. We further show that the additional equivalent mutants can lead to more accurate equivalent mutant classification models.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126821066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prioritized Variable-length Test Cases Generation for Finite State Machines 有限状态机的优先可变长度测试用例生成

2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)

Pub Date : 2022-03-17 DOI: 10.1109/ICSTW55395.2022.00017

Vaclav Rechtberger, Miroslav Bures, Bestoun S. Ahmed, Y. Belkhier, Jiří Néma, H. Schvach

Model-based Testing (MBT) is an effective approach for testing when parts of a system-under-test have the characteristics of a finite state machine (FSM). Despite various strategies in the literature on this topic, little work exists to handle special testing situations. More specifically, when concurrently: (1) the test paths can start and end only in defined states of the FSM, (2) a prioritization mechanism that requires only defined states and transitions of the FSM to be visited by test cases is required, and (3) the test paths must be in a given length range, not necessarily of explicit uniform length. This paper presents a test generation strategy that satisfies all these requirements. A concurrent combination of these requirements is highly practical for real industrial testing. Six variants of possible algorithms to implement this strategy are described. Using a mixture of 180 problem instances from real automotive and defense projects and artificially generated FSMs, all variants are compared with a baseline strategy based on an established N-switch coverage concept modification. Various properties of the generated test paths and their potential to activate fictional defects defined in FSMs are evaluated. The presented strategy outperforms the baseline in most problem configurations. Out of the six analyzed variants, three give the best results even though a universal best performer is hard to identify. Depending on the application of the FSM, the strategy and evaluation presented in this paper are applicable both in testing functional and non-functional software requirements.

当被测系统的某些部分具有有限状态机(FSM)的特征时，基于模型的测试(MBT)是一种有效的测试方法。尽管关于这个主题的文献中有各种各样的策略，但很少有处理特殊测试情况的工作。更具体地说，当并发时:(1)测试路径只能在FSM的定义状态中开始和结束，(2)需要一种优先级机制，该机制只要求FSM的定义状态和由测试用例访问的转换，以及(3)测试路径必须在给定的长度范围内，不一定是显式的统一长度。本文提出了一种满足所有这些需求的测试生成策略。这些需求的并发组合对于实际的工业测试是非常实用的。描述了实现该策略的六种可能的算法变体。使用来自真实汽车和国防项目以及人工生成的fsm的180个问题实例，将所有变体与基于已建立的N-switch覆盖概念修改的基线策略进行比较。所生成的测试路径的各种特性及其激活在fsm中定义的虚构缺陷的潜力被评估。所提出的策略在大多数问题配置中优于基线。在分析的六个变体中，有三个给出了最好的结果，尽管很难确定一个通用的最佳表现。根据FSM的应用，本文提出的策略和评估方法在测试功能和非功能软件需求时都是适用的。

{"title":"Prioritized Variable-length Test Cases Generation for Finite State Machines","authors":"Vaclav Rechtberger, Miroslav Bures, Bestoun S. Ahmed, Y. Belkhier, Jiří Néma, H. Schvach","doi":"10.1109/ICSTW55395.2022.00017","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00017","url":null,"abstract":"Model-based Testing (MBT) is an effective approach for testing when parts of a system-under-test have the characteristics of a finite state machine (FSM). Despite various strategies in the literature on this topic, little work exists to handle special testing situations. More specifically, when concurrently: (1) the test paths can start and end only in defined states of the FSM, (2) a prioritization mechanism that requires only defined states and transitions of the FSM to be visited by test cases is required, and (3) the test paths must be in a given length range, not necessarily of explicit uniform length. This paper presents a test generation strategy that satisfies all these requirements. A concurrent combination of these requirements is highly practical for real industrial testing. Six variants of possible algorithms to implement this strategy are described. Using a mixture of 180 problem instances from real automotive and defense projects and artificially generated FSMs, all variants are compared with a baseline strategy based on an established N-switch coverage concept modification. Various properties of the generated test paths and their potential to activate fictional defects defined in FSMs are evaluated. The presented strategy outperforms the baseline in most problem configurations. Out of the six analyzed variants, three give the best results even though a universal best performer is hard to identify. Depending on the application of the FSM, the strategy and evaluation presented in this paper are applicable both in testing functional and non-functional software requirements.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132883389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Overview of Test Coverage Criteria for Test Case Generation from Finite State Machines Modelled as Directed Graphs 用有向图建模的有限状态机生成测试用例的测试覆盖标准概述

2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)

Pub Date : 2022-03-17 DOI: 10.1109/ICSTW55395.2022.00044

Vaclav Rechtberger, Miroslav Bures, Bestoun S. Ahmed

Test Coverage criteria are an essential concept for test engineers when generating the test cases from a System Under Test model. They are routinely used in test case generation for user interfaces, middleware, and back-end system parts for software, electronics, or Internet of Things (IoT) systems. Test Coverage criteria define the number of actions or combinations by which a system is tested, informally determining a potential "strength" of a test set. As no previous study summarized all commonly used test coverage criteria for Finite State Machines and comprehensively discussed them regarding their subsumption, equivalence, or non-comparability, this paper provides this overview. In this study, 14 most common test coverage criteria and seven of their synonyms for Finite State Machines defined via a directed graph are summarized and compared. The results give researchers and industry testing engineers a helpful overview when setting a software-based or IoT system test strategy.

当从测试下系统模型生成测试用例时，测试覆盖标准是测试工程师的基本概念。它们通常用于生成用户界面、中间件和软件、电子或物联网(IoT)系统的后端系统部件的测试用例。测试覆盖标准定义了测试系统的操作或组合的数量，非正式地确定了测试集的潜在“强度”。由于之前没有研究总结了有限状态机所有常用的测试覆盖标准，并全面地讨论了它们的包容性、等价性或非可比性，因此本文提供了这个概述。在这项研究中，总结和比较了14个最常见的测试覆盖标准和7个通过有向图定义的有限状态机的同义词。研究结果为研究人员和行业测试工程师在设置基于软件或物联网系统的测试策略时提供了有用的概述。

引用次数: 2

µBert: Mutation Testing using Pre-Trained Language Models 使用预训练语言模型的突变测试

2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)

Pub Date : 2022-03-07 DOI: 10.1109/ICSTW55395.2022.00039

Renzo Degiovanni, Mike Papadakis

We introduce µBert, a mutation testing tool that uses a pre-trained language model (CodeBERT) to generate mutants. This is done by masking a token from the expression given as input and using CodeBERT to predict it. Thus, the mutants are generated by replacing the masked tokens with the predicted ones. We evaluate µBert on 40 real faults from Defects4J and show that it can detect 27 out of the 40 faults, while the baseline (PiTest) detects 26 of them. We also show that µBert can be 2 times more cost-effective than PiTest, when the same number of mutants are analysed. Additionally, we evaluate the impact of µBert’s mutants when used by program assertion inference techniques, and show that they can help in producing better specifications. Finally, we discuss about the quality and naturalness of some interesting mutants produced by µBert during our experimental evaluation.

我们介绍µBert，一个使用预训练语言模型(CodeBERT)生成突变的突变测试工具。这是通过屏蔽作为输入的表达式中的一个标记并使用CodeBERT来预测它来实现的。因此，突变体是通过用预测的标记替换被掩盖的标记来生成的。我们对来自Defects4J的40个真实故障评估µBert，并表明它可以检测到40个故障中的27个，而基线(PiTest)可以检测到其中的26个。我们还表明，当分析相同数量的突变体时，µBert的成本效益是PiTest的2倍。此外，我们评估了µBert突变体在程序断言推理技术中使用时的影响，并表明它们可以帮助生成更好的规范。最后，我们在实验评估中讨论了µBert产生的一些有趣突变体的质量和自然性。

引用次数: 12

Testing Deep Learning Models: A First Comparative Study of Multiple Testing Techniques 测试深度学习模型:多种测试技术的首次比较研究

2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)

Pub Date : 2022-02-24 DOI: 10.1109/ICSTW55395.2022.00035

M. K. Ahuja, A. Gotlieb, Helge Spieker

Deep Learning (DL) has revolutionized the capabilities of vision-based systems (VBS) in critical applications such as autonomous driving, robotic surgery, critical infrastructure surveillance, air and maritime traffic control, etc. By analyzing images, voice, videos, or any type of complex signals, DL has considerably increased the situation awareness of these systems. At the same time, while relying more and more on trained DL models, the reliability and robustness of VBS have been challenged and it has become crucial to test thoroughly these models to assess their capabilities and potential errors. To discover faults in DL models, existing software testing methods have been adapted and refined accordingly. In this article, we provide an overview of these software testing methods, namely differential, metamorphic, mutation, and combinatorial testing, as well as adversarial perturbation testing and review some challenges in their deployment for boosting perception systems used in VBS. We also provide a first experimental comparative study on a classical benchmark used in VBS and discuss its results.

深度学习(DL)彻底改变了基于视觉的系统(VBS)在自动驾驶、机器人手术、关键基础设施监控、空中和海上交通管制等关键应用中的能力。同时，在越来越多地依赖训练好的深度学习模型的同时，VBS的可靠性和鲁棒性受到了挑战，对这些模型进行彻底测试以评估其能力和潜在错误变得至关重要。为了发现深度学习模型中的错误，对现有的软件测试方法进行了相应的调整和改进。在本文中，我们概述了这些软件测试方法，即微分测试、变质测试、突变测试和组合测试，以及对抗性扰动测试，并回顾了在部署这些方法以增强VBS中使用的感知系统时面临的一些挑战。我们还首次对VBS中使用的经典基准进行了实验比较研究，并讨论了其结果。

{"title":"Testing Deep Learning Models: A First Comparative Study of Multiple Testing Techniques","authors":"M. K. Ahuja, A. Gotlieb, Helge Spieker","doi":"10.1109/ICSTW55395.2022.00035","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00035","url":null,"abstract":"Deep Learning (DL) has revolutionized the capabilities of vision-based systems (VBS) in critical applications such as autonomous driving, robotic surgery, critical infrastructure surveillance, air and maritime traffic control, etc. By analyzing images, voice, videos, or any type of complex signals, DL has considerably increased the situation awareness of these systems. At the same time, while relying more and more on trained DL models, the reliability and robustness of VBS have been challenged and it has become crucial to test thoroughly these models to assess their capabilities and potential errors. To discover faults in DL models, existing software testing methods have been adapted and refined accordingly. In this article, we provide an overview of these software testing methods, namely differential, metamorphic, mutation, and combinatorial testing, as well as adversarial perturbation testing and review some challenges in their deployment for boosting perception systems used in VBS. We also provide a first experimental comparative study on a classical benchmark used in VBS and discuss its results.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133946264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Pinpointing Anomaly Events in Logs from Stability Testing – N-Grams vs. Deep-Learning 从稳定性测试中精确定位日志中的异常事件——N-Grams vs.深度学习

2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)

Pub Date : 2022-02-18 DOI: 10.1109/ICSTW55395.2022.00056

M. Mäntylä, M. Varela, Shayan Hashemi

As stability testing execution logs can be very long, software engineers need help in locating anomalous events. We develop and evaluate two models for scoring individual log-events for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM (Long short-term memory). Both are trained on normal log sequences only. We evaluate the models with long log sequences of Android stability testing in our company case and with short log sequences from HDFS (Hadoop Distributed File System) public dataset. We evaluate next event prediction accuracy and computational efficiency. The LSTM model is more accurate in stability testing logs (0.848 vs 0.865), whereas in HDFS logs the N-Gram is slightly more accurate (0.904 vs 0.900). The N-Gram model has far superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found limited benefits in doing so. However, future work should consider whether our finding holds with different LSTM-model hyper-parameters, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.

由于稳定性测试执行日志可能非常长，软件工程师需要帮助来定位异常事件。我们开发并评估了两种模型，用于对单个日志事件进行异常评分，即N-Gram模型和具有LSTM(长短期记忆)的深度学习模型。两者都只在正常的对数序列上训练。在我们公司的案例中，我们使用Android稳定性测试的长日志序列和来自HDFS (Hadoop分布式文件系统)公共数据集的短日志序列来评估模型。对下一个事件的预测精度和计算效率进行了评价。LSTM模型在稳定性测试日志中更准确(0.848 vs 0.865)，而在HDFS日志中N-Gram稍微更准确(0.904 vs 0.900)。与Deep模型相比，N-Gram模型的计算效率要高得多(4到13秒vs 16分钟到近4小时)，使其成为我们案例公司的首选。对异常的单个日志事件进行评分似乎是对失败测试用例的根本原因分析的一个很好的帮助，并且我们的案例公司计划将其添加到其在线服务中。尽管最近在软件系统异常检测中使用深度学习激增，但我们发现这样做的好处有限。然而，未来的工作应该考虑我们的发现是否适用于不同的LSTM模型超参数、其他数据集和其他深度学习方法，这些方法比基于LSTM的模型承诺更好的准确性和计算效率。

{"title":"Pinpointing Anomaly Events in Logs from Stability Testing – N-Grams vs. Deep-Learning","authors":"M. Mäntylä, M. Varela, Shayan Hashemi","doi":"10.1109/ICSTW55395.2022.00056","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00056","url":null,"abstract":"As stability testing execution logs can be very long, software engineers need help in locating anomalous events. We develop and evaluate two models for scoring individual log-events for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM (Long short-term memory). Both are trained on normal log sequences only. We evaluate the models with long log sequences of Android stability testing in our company case and with short log sequences from HDFS (Hadoop Distributed File System) public dataset. We evaluate next event prediction accuracy and computational efficiency. The LSTM model is more accurate in stability testing logs (0.848 vs 0.865), whereas in HDFS logs the N-Gram is slightly more accurate (0.904 vs 0.900). The N-Gram model has far superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found limited benefits in doing so. However, future work should consider whether our finding holds with different LSTM-model hyper-parameters, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129603390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Systematic Training and Testing for Machine Learning Using Combinatorial Interaction Testing 使用组合交互测试的机器学习系统训练和测试

2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)

Pub Date : 2022-01-28 DOI: 10.1109/ICSTW55395.2022.00031

Tyler Cody, Erin Lanus, Daniel D. Doyle, Laura J. Freeman

This paper demonstrates the systematic use of combinatorial coverage for selecting and characterizing test and training sets for machine learning models. The presented work adapts combinatorial interaction testing, which has been successfully leveraged in identifying faults in software testing, to characterize data used in machine learning. The MNIST hand-written digits data is used to demonstrate that combinatorial coverage can be used to select test sets that stress machine learning model performance, to select training sets that lead to robust model performance, and to select data for fine-tuning models to new domains. Thus, the results posit combinatorial coverage as a holistic approach to training and testing for machine learning. In contrast to prior work which has focused on the use of coverage in regard to the internal of neural networks, this paper considers coverage over simple features derived from inputs and outputs. Thus, this paper addresses the case where the supplier of test and training sets for machine learning models does not have intellectual property rights to the models themselves. Finally, the paper addresses prior criticism of combinatorial coverage and provides a rebuttal which advocates the use of coverage metrics in machine learning applications.

本文演示了系统地使用组合覆盖来选择和表征机器学习模型的测试和训练集。本文采用组合交互测试来描述机器学习中使用的数据，组合交互测试已经成功地用于识别软件测试中的错误。MNIST手写数字数据用于证明组合覆盖可用于选择强调机器学习模型性能的测试集，选择导致鲁棒模型性能的训练集，以及选择用于微调模型到新领域的数据。因此，结果假设组合覆盖作为机器学习训练和测试的整体方法。与之前专注于使用神经网络内部覆盖的工作相反，本文考虑了来自输入和输出的简单特征的覆盖。因此，本文解决了机器学习模型的测试和训练集的供应商对模型本身没有知识产权的情况。最后，本文解决了先前对组合覆盖的批评，并提供了一个反驳，该反驳主张在机器学习应用中使用覆盖度量。

{"title":"Systematic Training and Testing for Machine Learning Using Combinatorial Interaction Testing","authors":"Tyler Cody, Erin Lanus, Daniel D. Doyle, Laura J. Freeman","doi":"10.1109/ICSTW55395.2022.00031","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00031","url":null,"abstract":"This paper demonstrates the systematic use of combinatorial coverage for selecting and characterizing test and training sets for machine learning models. The presented work adapts combinatorial interaction testing, which has been successfully leveraged in identifying faults in software testing, to characterize data used in machine learning. The MNIST hand-written digits data is used to demonstrate that combinatorial coverage can be used to select test sets that stress machine learning model performance, to select training sets that lead to robust model performance, and to select data for fine-tuning models to new domains. Thus, the results posit combinatorial coverage as a holistic approach to training and testing for machine learning. In contrast to prior work which has focused on the use of coverage in regard to the internal of neural networks, this paper considers coverage over simple features derived from inputs and outputs. Thus, this paper addresses the case where the supplier of test and training sets for machine learning models does not have intellectual property rights to the models themselves. Finally, the paper addresses prior criticism of combinatorial coverage and provides a rebuttal which advocates the use of coverage metrics in machine learning applications.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"46 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114105167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Early Detection of Network Attacks Using Deep Learning 基于深度学习的网络攻击早期检测

2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)

Pub Date : 2022-01-27 DOI: 10.1109/ICSTW55395.2022.00020

Tanwir Ahmad, D. Truscan, Juri Vain, Ivan Porres

The Internet has become a prime subject to security attacks and intrusions by attackers. These attacks can lead to system malfunction, network breakdown, data corruption or theft. A network intrusion detection system (IDS) is a tool used for identifying unauthorized and malicious behavior by observing the network traffic. State-of-the-art intrusion detection systems are designed to detect an attack by inspecting the complete information about the attack. This means that an IDS would only be able to detect an attack after it has been executed on the system under attack and might have caused damage to the system. In this paper, we propose an end-to-end early intrusion detection system to prevent network attacks before they could cause any more damage to the system under attack while preventing unforeseen downtime and interruption. We employ a deep neural network-based classifier for attack identification. The network is trained in a supervised manner to extract relevant features from raw network traffic data instead of relying on a manual feature selection process used in most related approaches. Further, we introduce a new metric, called earliness, to evaluate how early our proposed approach detects attacks. We have empirically evaluated our approach on the CICIDS2017 dataset. The results show that our approach performed well and attained an overall 0.803 balanced accuracy.

互联网已成为攻击者安全攻击和入侵的主要对象。这些攻击可能导致系统故障、网络崩溃、数据损坏或被盗。网络入侵检测系统(IDS)是一种通过观察网络流量来识别未经授权和恶意行为的工具。最先进的入侵检测系统旨在通过检查攻击的完整信息来检测攻击。这意味着IDS只能在攻击系统上执行攻击并可能对系统造成损害之后才能检测到攻击。在本文中，我们提出了一个端到端的早期入侵检测系统，在网络攻击对被攻击的系统造成更大的破坏之前阻止网络攻击，同时防止不可预见的停机和中断。我们采用基于深度神经网络的分类器进行攻击识别。该网络以监督的方式进行训练，以从原始网络流量数据中提取相关特征，而不是依赖于大多数相关方法中使用的手动特征选择过程。此外，我们引入了一个新的度量，称为earliness，以评估我们提出的方法检测攻击的早期程度。我们在CICIDS2017数据集上对我们的方法进行了实证评估。结果表明，我们的方法表现良好，达到了0.803的总体平衡精度。

{"title":"Early Detection of Network Attacks Using Deep Learning","authors":"Tanwir Ahmad, D. Truscan, Juri Vain, Ivan Porres","doi":"10.1109/ICSTW55395.2022.00020","DOIUrl":"https://doi.org/10.1109/ICSTW55395.2022.00020","url":null,"abstract":"The Internet has become a prime subject to security attacks and intrusions by attackers. These attacks can lead to system malfunction, network breakdown, data corruption or theft. A network intrusion detection system (IDS) is a tool used for identifying unauthorized and malicious behavior by observing the network traffic. State-of-the-art intrusion detection systems are designed to detect an attack by inspecting the complete information about the attack. This means that an IDS would only be able to detect an attack after it has been executed on the system under attack and might have caused damage to the system. In this paper, we propose an end-to-end early intrusion detection system to prevent network attacks before they could cause any more damage to the system under attack while preventing unforeseen downtime and interruption. We employ a deep neural network-based classifier for attack identification. The network is trained in a supervised manner to extract relevant features from raw network traffic data instead of relying on a manual feature selection process used in most related approaches. Further, we introduce a new metric, called earliness, to evaluate how early our proposed approach detects attacks. We have empirically evaluated our approach on the CICIDS2017 dataset. The results show that our approach performed well and attained an overall 0.803 balanced accuracy.","PeriodicalId":147133,"journal":{"name":"2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)","volume":"376 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129162421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10