A Multi-factor Approach for Flaky Test Detection and Automated Root Cause Analysis

2021 28th Asia-Pacific Software Engineering Conference (APSEC) Pub Date : 2021-12-01 DOI:10.1109/APSEC53868.2021.00041

Azeem Ahmad, F. D. O. Neto, Zhixiang Shi, K. Sandahl, O. Leifler

{"title":"A Multi-factor Approach for Flaky Test Detection and Automated Root Cause Analysis","authors":"Azeem Ahmad, F. D. O. Neto, Zhixiang Shi, K. Sandahl, O. Leifler","doi":"10.1109/APSEC53868.2021.00041","DOIUrl":null,"url":null,"abstract":"Developers often spend time to determine whether test case failures are real failures or flaky. The flaky tests, also known as non-deterministic tests, switch their outcomes without any modification in the codebase, hence reducing the confidence of developers during maintenance as well as in the quality of a product. Re-running test cases to reveal flakiness is resource-consuming, unreliable and does not reveal the root causes of test flakiness. Our paper evaluates a multi-factor approach to identify flaky test executions implemented in a tool named MDF laker. The four factors are: trace-back coverage, flaky frequency, number of test smells, and test size. Based on the extracted factors, MDFlaker uses k-Nearest Neighbor (KNN) to determine whether failed test executions are flaky. We investigate MDFlaker in a case study with 2166 test executions from different open-source repositories. We evaluate the effectiveness of our flaky detection tool. We illustrate how the multi-factor approach can be used to reveal root causes for flakiness, and we conduct a qualitative comparison between MDF laker and other tools proposed in literature. Our results show that the combination of different factors can be used to identify flaky tests. Each factor has its own trade-off, e.g., trace-back leads to many true positives, while flaky frequency yields more true negatives. Therefore, specific combinations of factors enable classification for testers with limited information (e.g., not enough test history information).","PeriodicalId":143800,"journal":{"name":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSEC53868.2021.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Developers often spend time to determine whether test case failures are real failures or flaky. The flaky tests, also known as non-deterministic tests, switch their outcomes without any modification in the codebase, hence reducing the confidence of developers during maintenance as well as in the quality of a product. Re-running test cases to reveal flakiness is resource-consuming, unreliable and does not reveal the root causes of test flakiness. Our paper evaluates a multi-factor approach to identify flaky test executions implemented in a tool named MDF laker. The four factors are: trace-back coverage, flaky frequency, number of test smells, and test size. Based on the extracted factors, MDFlaker uses k-Nearest Neighbor (KNN) to determine whether failed test executions are flaky. We investigate MDFlaker in a case study with 2166 test executions from different open-source repositories. We evaluate the effectiveness of our flaky detection tool. We illustrate how the multi-factor approach can be used to reveal root causes for flakiness, and we conduct a qualitative comparison between MDF laker and other tools proposed in literature. Our results show that the combination of different factors can be used to identify flaky tests. Each factor has its own trade-off, e.g., trace-back leads to many true positives, while flaky frequency yields more true negatives. Therefore, specific combinations of factors enable classification for testers with limited information (e.g., not enough test history information).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

片状测试检测及自动化根本原因分析的多因素方法

开发人员经常花费时间来确定测试用例失败是真正的失败还是偶然的失败。不稳定的测试，也被称为非确定性测试，在代码库中没有任何修改的情况下切换它们的结果，因此降低了开发人员在维护期间以及对产品质量的信心。重新运行测试用例来揭示脆弱是消耗资源的，不可靠的，并且不能揭示测试脆弱的根本原因。我们的论文评估了一种多因素方法来识别在一个名为MDF lake的工具中实现的不稳定的测试执行。这四个因素是:回溯覆盖率、片状频率、测试气味的数量和测试大小。基于提取的因素，MDFlaker使用k-最近邻(KNN)来确定失败的测试执行是否是片状的。我们在一个案例研究中调查了来自不同开源存储库的2166个测试执行。我们评估我们的片状检测工具的有效性。我们说明了如何使用多因素方法来揭示片状的根本原因，并对MDF lake和文献中提出的其他工具进行了定性比较。结果表明，不同因素的组合可用于片状试验的识别。每个因素都有自己的权衡，例如，回溯导致许多真阳性，而片状频率产生更多的真阴性。因此，特定的因素组合可以对信息有限的测试人员进行分类(例如，没有足够的测试历史信息)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 28th Asia-Pacific Software Engineering Conference (APSEC)

自引率

0.00%

发文量

期刊最新文献

Verification Assisted Gas Reduction for Smart Contracts Effective Bug Triage Based on a Hybrid Neural Network Learn To Align: A Code Alignment Network For Code Clone Detection Framework for Recommending Data Residency Compliant Application Architecture Degree doesn't Matter: Identifying the Drivers of Interaction in Software Development Ecosystems