On the use of mutation analysis for evaluating student test suite quality

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis Pub Date : 2022-07-18 DOI:10.1145/3533767.3534217

J. Perretta, A. DeOrio, Arjun Guha, Jonathan Bell

{"title":"On the use of mutation analysis for evaluating student test suite quality","authors":"J. Perretta, A. DeOrio, Arjun Guha, Jonathan Bell","doi":"10.1145/3533767.3534217","DOIUrl":null,"url":null,"abstract":"A common practice in computer science courses is to evaluate student-written test suites against either a set of manually-seeded faults (handwritten by an instructor) or against all other student-written implementations (“all-pairs” grading). However, manually seeding faults is a time consuming and potentially error-prone process, and the all-pairs approach requires significant manual and computational effort to apply fairly and accurately. Mutation analysis, which automatically seeds potential faults in an implementation, is a possible alternative to these test suite evaluation approaches. Although there is evidence in the literature that mutants are a valid substitute for real faults in large open-source software projects, it is unclear whether mutants are representative of the kinds of faults that students make. If mutants are a valid substitute for faults found in student-written code, and if mutant detection is correlated with manually-seeded fault detection and faulty student implementation detection, then instructors can instead evaluate student test suites using mutants generated by open-source mutation analysis tools. Using a dataset of 2,711 student assignment submissions, we empirically evaluate whether mutation score is a good proxy for manually-seeded fault detection rate and faulty student implementation detection rate. Our results show a strong correlation between mutation score and manually-seeded fault detection rate and a moderately strong correlation between mutation score and faulty student implementation detection. We identify a handful of faults in student implementations that, to be coupled to a mutant, would require new or stronger mutation operators or applying mutation operators to an implementation with a different structure than the instructor-written implementation. We also find that this correlation is limited by the fact that faults are not distributed evenly throughout student code, a known drawback of all-pairs grading. Our results suggest that mutants produced by open-source mutation analysis tools are of equal or higher quality than manually-seeded faults and a reasonably good stand-in for real faults in student implementations. Our findings have implications for software testing researchers, educators, and tool builders alike.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3533767.3534217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

A common practice in computer science courses is to evaluate student-written test suites against either a set of manually-seeded faults (handwritten by an instructor) or against all other student-written implementations (“all-pairs” grading). However, manually seeding faults is a time consuming and potentially error-prone process, and the all-pairs approach requires significant manual and computational effort to apply fairly and accurately. Mutation analysis, which automatically seeds potential faults in an implementation, is a possible alternative to these test suite evaluation approaches. Although there is evidence in the literature that mutants are a valid substitute for real faults in large open-source software projects, it is unclear whether mutants are representative of the kinds of faults that students make. If mutants are a valid substitute for faults found in student-written code, and if mutant detection is correlated with manually-seeded fault detection and faulty student implementation detection, then instructors can instead evaluate student test suites using mutants generated by open-source mutation analysis tools. Using a dataset of 2,711 student assignment submissions, we empirically evaluate whether mutation score is a good proxy for manually-seeded fault detection rate and faulty student implementation detection rate. Our results show a strong correlation between mutation score and manually-seeded fault detection rate and a moderately strong correlation between mutation score and faulty student implementation detection. We identify a handful of faults in student implementations that, to be coupled to a mutant, would require new or stronger mutation operators or applying mutation operators to an implementation with a different structure than the instructor-written implementation. We also find that this correlation is limited by the fact that faults are not distributed evenly throughout student code, a known drawback of all-pairs grading. Our results suggest that mutants produced by open-source mutation analysis tools are of equal or higher quality than manually-seeded faults and a reasonably good stand-in for real faults in student implementations. Our findings have implications for software testing researchers, educators, and tool builders alike.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关于使用突变分析来评估学生测试套件质量

在计算机科学课程中，一个常见的做法是根据一组人工播种的错误(由教师手写)或所有其他学生编写的实现(“全对”评分)来评估学生编写的测试套件。然而，手动播种故障是一个耗时且容易出错的过程，并且全对方法需要大量的人工和计算工作才能公平准确地应用。突变分析可以自动为实现中的潜在错误播种，是这些测试套件评估方法的可能替代方法。尽管文献中有证据表明，突变体是大型开源软件项目中真实错误的有效替代品，但尚不清楚突变体是否代表了学生犯的各种错误。如果突变体是在学生编写的代码中发现的错误的有效替代品，并且如果突变体检测与手动播种的错误检测和错误的学生实现检测相关联，那么教师可以使用由开源突变分析工具生成的突变体来评估学生测试套件。使用2711个学生提交的作业数据集，我们实证地评估了突变分数是否可以很好地代表人工播种的错误检测率和错误的学生执行检测率。我们的研究结果表明，突变分数与人工播种的故障检测率之间存在很强的相关性，而突变分数与错误的学生实施检测之间存在中等强的相关性。我们确定了学生实现中的一些错误，这些错误与突变相耦合，将需要新的或更强的突变操作符，或者将突变操作符应用于具有与讲师编写的实现不同结构的实现。我们还发现，这种相关性受到以下事实的限制:错误在整个学生代码中并不均匀分布，这是全对评分的一个已知缺点。我们的结果表明，由开源突变分析工具产生的突变与人工播种的故障质量相同或更高，并且是学生实现中真正故障的合理替代品。我们的发现对软件测试研究者、教育者和工具构建者都有启示意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

自引率

0.00%

发文量

期刊最新文献

One step further: evaluating interpreters using metamorphic testing Faster mutation analysis with MeMu Test mimicry to assess the exploitability of library vulnerabilities A large-scale study of usability criteria addressed by static analysis tools NCScope: hardware-assisted analyzer for native code in Android apps