Can testedness be effectively measured?

Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering Pub Date : 2016-11-01 DOI:10.1145/2950290.2950324

Iftekhar Ahmed, Rahul Gopinath, Caius Brindescu, Alex Groce, Carlos Jensen

{"title":"Can testedness be effectively measured?","authors":"Iftekhar Ahmed, Rahul Gopinath, Caius Brindescu, Alex Groce, Carlos Jensen","doi":"10.1145/2950290.2950324","DOIUrl":null,"url":null,"abstract":"Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of \"testedness\" have been proposed; unfortunately, we do not know whether these are truly effective. In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure. We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a \"poorly tested\" element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.","PeriodicalId":20532,"journal":{"name":"Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering","volume":"388 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2950290.2950324","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

Abstract

Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of "testedness" have been proposed; unfortunately, we do not know whether these are truly effective. In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure. We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a "poorly tested" element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

可测试性是否可以有效测量?

实践测试人员面临的主要问题之一是决定在哪里集中额外的测试工作，以及决定何时停止测试。测试最少测试的代码，并在所有代码都经过良好测试时停止测试，这是一个合理的答案。人们提出了许多“可测试性”的衡量标准;不幸的是，我们不知道这些是否真的有效。在本文中，我们对测试套件质量的两个最重要和最广泛使用的度量提出了一种新的评估方法。第一个度量是语句覆盖率，这是最简单和最著名的代码覆盖率度量。第二种方法是突变分数，这是一种被认为更有效但更昂贵的方法。我们使用感兴趣的实际标准来评估这些度量:如果一个程序元素(通过这些度量)在给定的时间点上得到了很好的测试，那么它应该比一个“测试不好”的元素需要更少的错误修复。如果不是，那么我们似乎没有有效地衡量可测试性。通过使用来自Github和Apache的大量开源Java程序，我们发现语句覆盖率和突变得分与bug修复只有微弱的负相关。尽管缺乏强相关性，但在各种二元标准的程序元素之间存在统计上和实践上的显著差异。被任何测试用例覆盖的程序元素(类除外)看到的bug修复数量大约是未被覆盖的程序元素的一半，并且可以为突变分数阈值绘制类似的线。我们的研究结果对软件工程实践和研究评估都有重要的意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering

自引率

0.00%

发文量

期刊最新文献

Evaluation of fault localization techniques Model, execute, and deploy: answering the hard questions in end-user programming (showcase) Guided code synthesis using deep neural networks Automated change impact analysis between SysML models of requirements and design Sustainable software design