2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)最新文献

英文中文

[Publisher's information] (发布者的信息)

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

Pub Date : 2019-04-01 DOI: 10.1109/icst.2019.00065

引用次数: 0

Automated Testing of Basic Recognition Capability for Speech Recognition Systems 语音识别系统基本识别能力的自动化测试

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

Pub Date : 2019-04-01 DOI: 10.1109/ICST.2019.00012

Futoshi Iwama, Takashi Fukuda

Automatic speech recognition systems transform speech audio data into text data, i.e., word sequences, as the recognition results. These word sequences are generally defined by the language model of the speech recognition system. Therefore, the capability of the speech recognition system to translate audio data obtained by typically pronouncing word sequences that are accepted by the language model into word sequences that are equivalent to the original ones can be regarded as a basic capability of the speech recognition systems. This work describes a testing method that checks whether speech recognition systems have this basic recognition capability. The method can verify the basic capability by performing the testing separately from recognition robustness testing. It can also be fully automated. We constructed a test automation system and evaluated though several experiments whether it could detect defects in speech recognition systems. The results demonstrate that the test automation system can effectively detect basic defects at an early phase of speech recognition development or refinement.

自动语音识别系统将语音音频数据转换为文本数据，即字序列作为识别结果。这些词序列通常由语音识别系统的语言模型定义。因此，语音识别系统将语言模型所接受的典型发音词序列所获得的音频数据翻译成与原词序列等效的词序列的能力可视为语音识别系统的一项基本能力。本文描述了一种检测语音识别系统是否具有这种基本识别能力的测试方法。该方法通过将测试与识别鲁棒性测试分开进行来验证基本能力。它也可以完全自动化。我们构建了一个测试自动化系统，并通过几个实验来评估它是否可以检测语音识别系统中的缺陷。结果表明，测试自动化系统可以在语音识别开发或改进的早期阶段有效地检测到基本缺陷。

引用次数: 2

Program Repair at Arbitrary Fault Depth 程序修复在任意故障深度

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

Pub Date : 2019-04-01 DOI: 10.1109/ICST.2019.00056

Besma Khaireddine, Matias Martinez, A. Mili

Program repair has been an active research area for over a decade and has achieved great strides in terms of scalable automated repair tools. In this paper we argue that existing program repair tools lack an important ingredient, which limits their scope and their efficiency: a formal definition of a fault, and a formal characterization of fault removal. To support our conjecture, we consider GenProg, an archetypical program repair tool, and modify it according to our definitions of fault and fault removal; then we show, by means of empirical experiments, the impact that this has on the effectiveness and efficiency of thee tool.

十多年来，程序修复一直是一个活跃的研究领域，并且在可扩展的自动修复工具方面取得了很大的进步。在本文中，我们认为现有的程序修复工具缺乏一个重要的成分，这限制了它们的范围和效率:故障的正式定义，以及故障去除的正式表征。为了支持我们的猜想，我们考虑了典型的程序修复工具GenProg，并根据我们对故障和故障排除的定义对其进行了修改;然后，通过实证实验，我们展示了这对三种工具的有效性和效率的影响。

引用次数: 8

Coverage-Driven Test Generation for Thread-Safe Classes via Parallel and Conflict Dependencies 通过并行和冲突依赖为线程安全类生成覆盖驱动的测试

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

Pub Date : 2019-04-01 DOI: 10.1109/ICST.2019.00034

Valerio Terragni, M. Pezzè, F. A. Bianchi

Thread-safe classes are common in concurrent object-oriented programs. Testing such classes is important to ensure the reliability of the concurrent programs that rely on them. Recently, researchers have proposed the automated generation of concurrent (multi-threaded) tests to expose concurrency faults in thread-safe classes (thread-safety violations). However, generating fault-revealing concurrent tests within an affordable time-budget is difficult due to the huge search space of possible concurrent tests. In this paper, we present DepCon, an approach to effectively reduce the search space of concurrent tests by means of both parallel and conflict dependency analyses. DepCon is based on the intuition that only methods that can both interleave (parallel dependent) and access the same shared memory locations (conflict dependent) can lead to thread-safety violations when concurrently executed. DepCon implements an efficient static analysis to compute the parallel and conflict dependencies among the methods of a class and uses the computed dependencies to steer the generation of tests towards concurrent tests that exhibit the computed dependencies. We evaluated DepCon by experimenting with a prototype implementation for Java programs on a set of thread-safe classes with known concurrency faults. The experimental results show that DepCon is more effective in exposing concurrency faults than state-of-the-art techniques.

线程安全类在并发的面向对象程序中很常见。测试这些类对于确保依赖它们的并发程序的可靠性非常重要。最近，研究人员提出了并发(多线程)测试的自动生成，以暴露线程安全类中的并发错误(线程安全违规)。然而，由于可能并发测试的巨大搜索空间，在可承受的时间预算内生成故障揭示并发测试是困难的。本文提出了一种通过并行和冲突依赖分析来有效减少并发测试搜索空间的方法DepCon。DepCon基于这样一种直觉，即只有既可以交错(并行依赖)又可以访问相同共享内存位置(冲突依赖)的方法在并发执行时才会导致违反线程安全。DepCon实现了一个有效的静态分析，以计算类的方法之间的并行和冲突依赖关系，并使用计算的依赖关系将测试的生成导向显示计算依赖关系的并发测试。我们通过在一组具有已知并发性错误的线程安全类上试验Java程序的原型实现来评估DepCon。实验结果表明，DepCon在暴露并发错误方面比现有技术更有效。

{"title":"Coverage-Driven Test Generation for Thread-Safe Classes via Parallel and Conflict Dependencies","authors":"Valerio Terragni, M. Pezzè, F. A. Bianchi","doi":"10.1109/ICST.2019.00034","DOIUrl":"https://doi.org/10.1109/ICST.2019.00034","url":null,"abstract":"Thread-safe classes are common in concurrent object-oriented programs. Testing such classes is important to ensure the reliability of the concurrent programs that rely on them. Recently, researchers have proposed the automated generation of concurrent (multi-threaded) tests to expose concurrency faults in thread-safe classes (thread-safety violations). However, generating fault-revealing concurrent tests within an affordable time-budget is difficult due to the huge search space of possible concurrent tests. In this paper, we present DepCon, an approach to effectively reduce the search space of concurrent tests by means of both parallel and conflict dependency analyses. DepCon is based on the intuition that only methods that can both interleave (parallel dependent) and access the same shared memory locations (conflict dependent) can lead to thread-safety violations when concurrently executed. DepCon implements an efficient static analysis to compute the parallel and conflict dependencies among the methods of a class and uses the computed dependencies to steer the generation of tests towards concurrent tests that exhibit the computed dependencies. We evaluated DepCon by experimenting with a prototype implementation for Java programs on a set of thread-safe classes with known concurrency faults. The experimental results show that DepCon is more effective in exposing concurrency faults than state-of-the-art techniques.","PeriodicalId":446827,"journal":{"name":"2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131454302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Directing a Search Towards Execution Properties with a Learned Fitness Function 用学习适应度函数指导对执行属性的搜索

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

Pub Date : 2019-04-01 DOI: 10.1109/ICST.2019.00029

Leonid Joffe, D. Clark

Search based software testing is a popular and successful approach both in academia and industry. SBST methods typically aim to increase coverage whereas searching for executions with specific properties is largely unresearched. Fitness functions for execution properties often possess search landscapes that are difficult or intractable. We demonstrate how machine learning techniques can convert a property that is not searchable, in this case crashes, into one that is. Through experimentation on 6000 C programs drawn from the Codeflaws repository, we demonstrate a strong, program independent correlation between crashing executions and library function call patterns within those executions as discovered by a neural net. We then exploit the correlation to produce a searchable fitness landscape to modify American Fuzzy Lop, a widely used fuzz testing tool. On a test set of previously unseen programs drawn from Codeflaws, a search strategy based on a crash targeting fitness function outperformed a baseline in 80.1% of cases. The experiments were then repeated on three real world programs: the VLC media player, and the libjpeg and mpg321 libraries. The correlation between library call traces and crashes generalises as indicated by ROC AUC scores of 0.91, 0.88 and 0.61. The produced search landscape however is not convenient due to plateaus. This is likely because these programs do not use standard C libraries as often as do those in Codeflaws. This limitation can be overcome by considering a more powerful observation domain and a broader training corpus in future work. Despite limited generalisability of the experimental setup, this research opens new possibilities in the intersection of machine learning, fitness functions, and search based testing in general.

基于搜索的软件测试在学术界和工业界都是一种流行且成功的方法。SBST方法通常旨在增加覆盖率，而搜索具有特定属性的执行在很大程度上还没有研究过。执行属性的适应度函数通常具有难以处理的搜索环境。我们演示了机器学习技术如何将不可搜索的属性(在这种情况下是崩溃)转换为可搜索的属性。通过对从Codeflaws存储库中提取的6000个C程序进行实验，我们证明了神经网络发现的崩溃执行与这些执行中的库函数调用模式之间存在强大的、独立于程序的相关性。然后，我们利用相关性产生一个可搜索的健身景观来修改美国模糊Lop，一个广泛使用的模糊测试工具。在一组从Codeflaws提取的以前未见过的程序的测试集上，基于崩溃目标适应度函数的搜索策略在80.1%的情况下优于基线。然后在三个真实世界的程序上重复实验:VLC媒体播放器，libjpeg和mpg321库。库调用跟踪和崩溃之间的相关性由ROC AUC分数0.91、0.88和0.61表示。然而，由于平台，产生的搜索景观并不方便。这可能是因为这些程序不像Codeflaws中的程序那样经常使用标准C库。这一限制可以通过在未来的工作中考虑更强大的观察域和更广泛的训练语料库来克服。尽管实验设置的通用性有限，但本研究为机器学习，适应度函数和基于搜索的测试的交叉点开辟了新的可能性。

{"title":"Directing a Search Towards Execution Properties with a Learned Fitness Function","authors":"Leonid Joffe, D. Clark","doi":"10.1109/ICST.2019.00029","DOIUrl":"https://doi.org/10.1109/ICST.2019.00029","url":null,"abstract":"Search based software testing is a popular and successful approach both in academia and industry. SBST methods typically aim to increase coverage whereas searching for executions with specific properties is largely unresearched. Fitness functions for execution properties often possess search landscapes that are difficult or intractable. We demonstrate how machine learning techniques can convert a property that is not searchable, in this case crashes, into one that is. Through experimentation on 6000 C programs drawn from the Codeflaws repository, we demonstrate a strong, program independent correlation between crashing executions and library function call patterns within those executions as discovered by a neural net. We then exploit the correlation to produce a searchable fitness landscape to modify American Fuzzy Lop, a widely used fuzz testing tool. On a test set of previously unseen programs drawn from Codeflaws, a search strategy based on a crash targeting fitness function outperformed a baseline in 80.1% of cases. The experiments were then repeated on three real world programs: the VLC media player, and the libjpeg and mpg321 libraries. The correlation between library call traces and crashes generalises as indicated by ROC AUC scores of 0.91, 0.88 and 0.61. The produced search landscape however is not convenient due to plateaus. This is likely because these programs do not use standard C libraries as often as do those in Codeflaws. This limitation can be overcome by considering a more powerful observation domain and a broader training corpus in future work. Despite limited generalisability of the experimental setup, this research opens new possibilities in the intersection of machine learning, fitness functions, and search based testing in general.","PeriodicalId":446827,"journal":{"name":"2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123761193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Uniform Sampling of SAT Solutions for Configurable Systems: Are We There Yet? 可配置系统SAT解决方案的统一采样:我们做到了吗?

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

Pub Date : 2019-04-01 DOI: 10.1109/ICST.2019.00032

Quentin Plazar, M. Acher, Gilles Perrouin, Xavier Devroey, Maxime Cordy

Uniform or near-uniform generation of solutions for large satisfiability formulas is a problem of theoretical and practical interest for the testing community. Recent works proposed two algorithms (namely UniGen and QuickSampler) for reaching a good compromise between execution time and uniformity guarantees, with empirical evidence on SAT benchmarks. In the context of highly-configurable software systems (e.g., Linux), it is unclear whether UniGen and QuickSampler can scale and sample uniform software configurations. In this paper, we perform a thorough experiment on 128 real-world feature models. We find that UniGen is unable to produce SAT solutions out of such feature models. Furthermore, we show that QuickSampler does not generate uniform samples and that some features are either never part of the sample or too frequently present. Finally, using a case study, we characterize the impacts of these results on the ability to find bugs in a configurable system. Overall, our results suggest that we are not there: more research is needed to explore the cost-effectiveness of uniform sampling when testing large configurable systems.

大可满足性公式的一致或接近一致解的生成是测试界在理论和实践上都感兴趣的问题。最近的研究提出了两种算法(即UniGen和QuickSampler)，用于在执行时间和一致性保证之间达成良好的折衷，并在SAT基准测试中获得了经验证据。在高度可配置的软件系统(例如，Linux)的背景下，尚不清楚UniGen和QuickSampler是否可以扩展和采样统一的软件配置。在本文中，我们对128个真实世界的特征模型进行了彻底的实验。我们发现UniGen无法从这些特征模型中产生SAT解决方案。此外，我们表明QuickSampler不会生成均匀的样本，并且一些特征要么从来不是样本的一部分，要么太频繁地出现。最后，通过一个案例研究，我们描述了这些结果对在可配置系统中发现bug的能力的影响。总的来说，我们的结果表明我们还没有做到:在测试大型可配置系统时，需要更多的研究来探索统一采样的成本效益。

{"title":"Uniform Sampling of SAT Solutions for Configurable Systems: Are We There Yet?","authors":"Quentin Plazar, M. Acher, Gilles Perrouin, Xavier Devroey, Maxime Cordy","doi":"10.1109/ICST.2019.00032","DOIUrl":"https://doi.org/10.1109/ICST.2019.00032","url":null,"abstract":"Uniform or near-uniform generation of solutions for large satisfiability formulas is a problem of theoretical and practical interest for the testing community. Recent works proposed two algorithms (namely UniGen and QuickSampler) for reaching a good compromise between execution time and uniformity guarantees, with empirical evidence on SAT benchmarks. In the context of highly-configurable software systems (e.g., Linux), it is unclear whether UniGen and QuickSampler can scale and sample uniform software configurations. In this paper, we perform a thorough experiment on 128 real-world feature models. We find that UniGen is unable to produce SAT solutions out of such feature models. Furthermore, we show that QuickSampler does not generate uniform samples and that some features are either never part of the sample or too frequently present. Finally, using a case study, we characterize the impacts of these results on the ability to find bugs in a configurable system. Overall, our results suggest that we are not there: more research is needed to explore the cost-effectiveness of uniform sampling when testing large configurable systems.","PeriodicalId":446827,"journal":{"name":"2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123659155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 56

An Extensive Study on Cross-Project Predictive Mutation Testing 跨项目预测突变检测的广泛研究

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

Pub Date : 2019-04-01 DOI: 10.1109/ICST.2019.00025

Dongyu Mao, Lingchao Chen, Lingming Zhang

Mutation testing is a powerful technique for evaluating the quality of test suite which plays a key role in ensuring software quality. The concept of mutation testing has also been widely used in other software engineering studies, e.g., test generation, fault localization, and program repair. During the process of mutation testing, large number of mutants may be generated and then executed against the test suite to examine whether they can be killed, making the process extremely computational expensive. Several techniques have been proposed to speed up this process, including selective, weakened, and predictive mutation testing. Among those techniques, Predictive Mutation Testing (PMT) tries to build a classification model based on an amount of mutant execution records to predict whether coming new mutants would be killed or alive without mutant execution, and can achieve significant mutation cost reduction. In PMT, each mutant is represented as a list of features related to the mutant itself and the test suite, transforming the mutation testing problem to a binary classification problem. In this paper, we perform an extensive study on the effectiveness and efficiency of the promising PMT technique under the cross-project setting using a total 654 real world projects with more than 4 Million mutants. Our work also complements the original PMT work by considering more features and the powerful deep learning models. The experimental results show an average of over 0.85 prediction accuracy on 654 projects using cross validation, demonstrating the effectiveness of PMT. Meanwhile, a clear speed up is also observed with an average of 28.7X compared to traditional mutation testing with 5 threads. In addition, we analyze the importance of different groups of features in classification model, which provides important implications for the future research.

突变测试是一种强有力的测试套件质量评估技术，在保证软件质量方面起着关键作用。突变测试的概念也被广泛应用于其他软件工程研究中，例如测试生成、故障定位和程序修复。在突变测试过程中，可能会生成大量的突变体，然后对测试套件执行以检查它们是否可以被杀死，这使得该过程的计算成本非常高。已经提出了几种加速这一过程的技术，包括选择性、弱化和预测性突变检测。其中，预测性突变测试(Predictive Mutation Testing, PMT)试图根据大量的突变执行记录建立一个分类模型，预测即将到来的新突变体在不执行突变的情况下是被杀死还是存活，可以显著降低突变成本。在PMT中，每个突变被表示为与突变本身和测试套件相关的特征列表，将突变测试问题转换为二元分类问题。在本文中，我们对跨项目设置下有前途的PMT技术的有效性和效率进行了广泛的研究，使用了总共654个真实世界的项目，超过400万个突变体。我们的工作还通过考虑更多的特征和强大的深度学习模型来补充原始的PMT工作。实验结果表明，对654个项目进行交叉验证，平均预测精度超过0.85，证明了PMT的有效性。同时，与使用5个线程的传统突变测试相比，还观察到明显的速度提高，平均提高了28.7倍。此外，我们还分析了不同组的特征在分类模型中的重要性，为未来的研究提供了重要的启示。

{"title":"An Extensive Study on Cross-Project Predictive Mutation Testing","authors":"Dongyu Mao, Lingchao Chen, Lingming Zhang","doi":"10.1109/ICST.2019.00025","DOIUrl":"https://doi.org/10.1109/ICST.2019.00025","url":null,"abstract":"Mutation testing is a powerful technique for evaluating the quality of test suite which plays a key role in ensuring software quality. The concept of mutation testing has also been widely used in other software engineering studies, e.g., test generation, fault localization, and program repair. During the process of mutation testing, large number of mutants may be generated and then executed against the test suite to examine whether they can be killed, making the process extremely computational expensive. Several techniques have been proposed to speed up this process, including selective, weakened, and predictive mutation testing. Among those techniques, Predictive Mutation Testing (PMT) tries to build a classification model based on an amount of mutant execution records to predict whether coming new mutants would be killed or alive without mutant execution, and can achieve significant mutation cost reduction. In PMT, each mutant is represented as a list of features related to the mutant itself and the test suite, transforming the mutation testing problem to a binary classification problem. In this paper, we perform an extensive study on the effectiveness and efficiency of the promising PMT technique under the cross-project setting using a total 654 real world projects with more than 4 Million mutants. Our work also complements the original PMT work by considering more features and the powerful deep learning models. The experimental results show an average of over 0.85 prediction accuracy on 654 projects using cross validation, demonstrating the effectiveness of PMT. Meanwhile, a clear speed up is also observed with an average of 28.7X compared to traditional mutation testing with 5 threads. In addition, we analyze the importance of different groups of features in classification model, which provides important implications for the future research.","PeriodicalId":446827,"journal":{"name":"2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126163476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Do Pseudo Test Suites Lead to Inflated Correlation in Measuring Test Effectiveness? 伪测试套件会导致测试有效性度量的相关性膨胀吗?

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

Pub Date : 2019-04-01 DOI: 10.1109/ICST.2019.00033

Jie M. Zhang, Lingming Zhang, Dan Hao, Meng Wang, Lu Zhang

Code coverage is the most widely adopted criteria for measuring test effectiveness in software quality assurance. The performance of coverage criteria (in indicating test suites' effectiveness) has been widely studied in prior work. Most of the studies use randomly constructed pseudo test suites to facilitate data collection for correlation analysis, yet no previous work has systematically studied whether pseudo test suites would lead to inflated correlation results. This paper focuses on the potentially wide-spread threat with a study over 123 real-world Java projects. Following the typical experimental process of studying coverage criteria, we investigate the correlation between statement/assertion coverage and mutation score using both pseudo and original test suites. Except for direct correlation analysis, we control the number of assertions and the test suite size to conduct partial correlation analysis. The results reveal that 1) the correlation (between coverage criteria and mutation score) derived from pseudo test suites is much higher than from original test suites (from 0.21 to 0.39 higher in Kendall value); 2) contrary to previously reported, statement coverage has a stronger correlation with mutation score than assertion coverage.

在软件质量保证中，代码覆盖率是衡量测试有效性的最广泛采用的标准。覆盖标准的性能(表示测试套件的有效性)在以前的工作中得到了广泛的研究。大多数研究使用随机构建的伪测试套件来方便相关分析的数据收集，但尚未有工作系统地研究伪测试套件是否会导致夸大的相关结果。本文通过对123个真实Java项目的研究来关注潜在的广泛威胁。遵循研究覆盖率标准的典型实验过程，我们使用伪测试套件和原始测试套件研究语句/断言覆盖率与突变分数之间的相关性。除了直接相关分析外，我们还控制断言的数量和测试套件的大小来进行部分相关分析。结果表明:1)伪测试套件与原始测试套件的相关性(覆盖标准与突变评分之间的相关性)显著高于原始测试套件(Kendall值高0.21 ~ 0.39);2)与先前报道相反，语句覆盖率与突变分数的相关性比断言覆盖率更强。

{"title":"Do Pseudo Test Suites Lead to Inflated Correlation in Measuring Test Effectiveness?","authors":"Jie M. Zhang, Lingming Zhang, Dan Hao, Meng Wang, Lu Zhang","doi":"10.1109/ICST.2019.00033","DOIUrl":"https://doi.org/10.1109/ICST.2019.00033","url":null,"abstract":"Code coverage is the most widely adopted criteria for measuring test effectiveness in software quality assurance. The performance of coverage criteria (in indicating test suites' effectiveness) has been widely studied in prior work. Most of the studies use randomly constructed pseudo test suites to facilitate data collection for correlation analysis, yet no previous work has systematically studied whether pseudo test suites would lead to inflated correlation results. This paper focuses on the potentially wide-spread threat with a study over 123 real-world Java projects. Following the typical experimental process of studying coverage criteria, we investigate the correlation between statement/assertion coverage and mutation score using both pseudo and original test suites. Except for direct correlation analysis, we control the number of assertions and the test suite size to conduct partial correlation analysis. The results reveal that 1) the correlation (between coverage criteria and mutation score) derived from pseudo test suites is much higher than from original test suites (from 0.21 to 0.39 higher in Kendall value); 2) contrary to previously reported, statement coverage has a stronger correlation with mutation score than assertion coverage.","PeriodicalId":446827,"journal":{"name":"2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123302405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Extension-Aware Automated Testing Based on Imperative Predicates 基于命令式谓词的扩展感知自动化测试

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

Pub Date : 2019-04-01 DOI: 10.1109/ICST.2019.00013

Nima Dini, Cagdas Yelen, Miloš Gligorić, S. Khurshid

Bounded exhaustive testing (BET) techniques have been shown to be effective for detecting faults in software. BET techniques based on imperative predicates, enumerate all test inputs up to the given bounds such that each test input satisfies the properties encoded by the predicate. The search space is bounded by the user, who specifies the number of objects of each type and the list of values for each field of each type. To optimize the search, existing techniques detect isomorphic instances and record accessed fields during the execution of a predicate. However, these optimizations are extension-unaware, i.e., they do not speed up the search when the predicate is modified, say due to a fix or additional properties. We present a technique, named iGen, that speeds up test generation when imperative predicates are extended. iGen memoizes intermediate results of a test generation and reuses the results in a future search – even when the new search space differs from the old space. We integrated our technique in two BET tools (one for Java and one for Python) and evaluated these implementations with several data structure pairs, including two pairs from the Standard Java Library. Our results show that iGen speeds up test generation by up to 46.59x for the Java tool and up to 49.47x for the Python tool. Additionally, we show that the speedup obtained by iGen increases for larger test instances.

有界穷举测试(BET)技术已被证明是检测软件故障的有效方法。基于命令式谓词的BET技术将所有测试输入枚举到给定的边界，以便每个测试输入满足谓词编码的属性。搜索空间由用户限定，用户指定每种类型的对象数量和每种类型的每个字段的值列表。为了优化搜索，现有技术检测同构实例并在谓词执行期间记录访问的字段。然而，这些优化是与扩展无关的，也就是说，当谓词被修改时(比如由于修复或附加属性)，它们不会加快搜索速度。我们提出了一种名为iGen的技术，它可以在扩展命令式谓词时加速测试生成。iGen记忆测试生成的中间结果，并在未来的搜索中重用结果——即使新的搜索空间不同于旧的搜索空间。我们将我们的技术集成到两个BET工具中(一个用于Java，一个用于Python)，并使用几个数据结构对评估这些实现，其中包括来自标准Java库的两个数据结构对。我们的结果表明，对于Java工具，iGen将测试生成速度提高了46.59倍，对于Python工具，iGen将测试生成速度提高了49.47倍。此外，我们表明iGen获得的加速对于更大的测试实例会增加。

{"title":"Extension-Aware Automated Testing Based on Imperative Predicates","authors":"Nima Dini, Cagdas Yelen, Miloš Gligorić, S. Khurshid","doi":"10.1109/ICST.2019.00013","DOIUrl":"https://doi.org/10.1109/ICST.2019.00013","url":null,"abstract":"Bounded exhaustive testing (BET) techniques have been shown to be effective for detecting faults in software. BET techniques based on imperative predicates, enumerate all test inputs up to the given bounds such that each test input satisfies the properties encoded by the predicate. The search space is bounded by the user, who specifies the number of objects of each type and the list of values for each field of each type. To optimize the search, existing techniques detect isomorphic instances and record accessed fields during the execution of a predicate. However, these optimizations are extension-unaware, i.e., they do not speed up the search when the predicate is modified, say due to a fix or additional properties. We present a technique, named iGen, that speeds up test generation when imperative predicates are extended. iGen memoizes intermediate results of a test generation and reuses the results in a future search – even when the new search space differs from the old space. We integrated our technique in two BET tools (one for Java and one for Python) and evaluated these implementations with several data structure pairs, including two pairs from the Standard Java Library. Our results show that iGen speeds up test generation by up to 46.59x for the Java tool and up to 49.47x for the Python tool. Additionally, we show that the speedup obtained by iGen increases for larger test instances.","PeriodicalId":446827,"journal":{"name":"2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121898987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Model-Based Approach to Generate Dynamic Synthetic Test Data 基于模型的动态综合测试数据生成方法

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

Pub Date : 2019-04-01 DOI: 10.1109/ICST.2019.00063

Chao Tan

Having access to high-quality test data is an important requirement to ensure effective cross-organizational integration testing. The common practice for addressing this need is to generate synthetic data. However, existing approaches cannot generate representative datasets that can evolve to allow the simulation of the dynamics of the systems under test. In this PhD project, and in collaboration with an industrial partner, we investigate the use of machine learning techniques for developing novel solutions that can generate synthetic, dynamic and representative test data.

能够访问高质量的测试数据是确保有效的跨组织集成测试的重要需求。解决这一需求的常见做法是生成合成数据。然而，现有的方法不能生成具有代表性的数据集，这些数据集可以进化到允许模拟被测系统的动态。在这个博士项目中，我们与一个工业合作伙伴合作，研究使用机器学习技术来开发新的解决方案，这些解决方案可以生成合成的、动态的和有代表性的测试数据。

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀