Pub Date : 2023-05-01DOI: 10.1109/ast58925.2023.00026
{"title":"AST 2023 Steering Committee","authors":"","doi":"10.1109/ast58925.2023.00026","DOIUrl":"https://doi.org/10.1109/ast58925.2023.00026","url":null,"abstract":"","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121692627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/AST58925.2023.00021
Ranim Khojah, Chi Hong Chao, F. D. O. Neto
Diversity-based techniques (DBT) have been cost-effective by prioritizing the most dissimilar test cases to detect faults at earlier stages of test execution. Diversity is measured on test specifications to convey how different test cases are from one another. However, there is little research on the trade-off of diversity measures based on different types of text-based specification (lexicographical or semantics). Particularly because the text content in test scripts vary widely from unit (e.g., code) to system-level (e.g., natural language). This paper compares and evaluates the cost-effectiveness in coverage and failures of different text-based diversity measures for different levels of tests. We perform an experiment on the test suites of 7 open source projects on the unit level, and 2 industry projects on the integration and system level. Our results show that test suites prioritised using semantic-based diversity measures causes a small improvement in requirements coverage, as opposed to lexical diversity that showed less coverage than random for system-level artefacts. In contrast, using lexical-based measures such as Jaccard or Levenshtein to prioritise code artefacts yield better failure coverage across all levels of tests. We summarise our findings into a list of recommendations for using semantic or lexical diversity on different levels of testing.
{"title":"Evaluating the Trade-offs of Text-based Diversity in Test Prioritisation","authors":"Ranim Khojah, Chi Hong Chao, F. D. O. Neto","doi":"10.1109/AST58925.2023.00021","DOIUrl":"https://doi.org/10.1109/AST58925.2023.00021","url":null,"abstract":"Diversity-based techniques (DBT) have been cost-effective by prioritizing the most dissimilar test cases to detect faults at earlier stages of test execution. Diversity is measured on test specifications to convey how different test cases are from one another. However, there is little research on the trade-off of diversity measures based on different types of text-based specification (lexicographical or semantics). Particularly because the text content in test scripts vary widely from unit (e.g., code) to system-level (e.g., natural language). This paper compares and evaluates the cost-effectiveness in coverage and failures of different text-based diversity measures for different levels of tests. We perform an experiment on the test suites of 7 open source projects on the unit level, and 2 industry projects on the integration and system level. Our results show that test suites prioritised using semantic-based diversity measures causes a small improvement in requirements coverage, as opposed to lexical diversity that showed less coverage than random for system-level artefacts. In contrast, using lexical-based measures such as Jaccard or Levenshtein to prioritise code artefacts yield better failure coverage across all levels of tests. We summarise our findings into a list of recommendations for using semantic or lexical diversity on different levels of testing.","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114303252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/AST58925.2023.00020
Renan Greca, Breno Miranda, A. Bertolino
Regression testing is widely studied in the literature, although most research on the topic is concerned with improving specific sub-challenges of a wider goal. Test suite orchestration proposes a more comprehensive view of the challenge of regression testing, by merging and combining different techniques with a variety of objectives, including prioritizing, selecting, reducing and amplifying tests, detecting flaky tests and potentially more. This paper presents the key approaches and techniques that form test suite orchestration, along with common evaluation metrics, and discusses how they can be used together to ultimately provide an efficient and effective regression testing strategy. To illustrate the benefits of orchestration, we provide some examples of existing papers that take steps towards this goal, even if the specific terminology is not yet used. Orchestrated strategies utilizing existing regression testing techniques provide a pathway to practicality and real-world usage of the academic literature.
{"title":"Orchestration Strategies for Regression Test Suites","authors":"Renan Greca, Breno Miranda, A. Bertolino","doi":"10.1109/AST58925.2023.00020","DOIUrl":"https://doi.org/10.1109/AST58925.2023.00020","url":null,"abstract":"Regression testing is widely studied in the literature, although most research on the topic is concerned with improving specific sub-challenges of a wider goal. Test suite orchestration proposes a more comprehensive view of the challenge of regression testing, by merging and combining different techniques with a variety of objectives, including prioritizing, selecting, reducing and amplifying tests, detecting flaky tests and potentially more. This paper presents the key approaches and techniques that form test suite orchestration, along with common evaluation metrics, and discusses how they can be used together to ultimately provide an efficient and effective regression testing strategy. To illustrate the benefits of orchestration, we provide some examples of existing papers that take steps towards this goal, even if the specific terminology is not yet used. Orchestrated strategies utilizing existing regression testing techniques provide a pathway to practicality and real-world usage of the academic literature.","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130521819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/AST58925.2023.00018
Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, Yves Le Traon
Flaky tests are tests that yield different outcomes when run on the same version of a program. This non-deterministic behaviour plagues continuous integration with false signals, wasting developers’ time and reducing their trust in test suites. Studies highlighted the importance of keeping tests flakiness-free. Recently, the research community has been pushing towards the detection of flaky tests by suggesting many static and dynamic approaches. While promising, those approaches mainly focus on classifying tests as flaky or not and, even when high performances are reported, it remains challenging to understand the cause of flakiness. This part is crucial for researchers and developers that aim to fix it. To help with the comprehension of a given flaky test, we propose FlakyCat, the first approach to classify flaky tests based on their root cause category. FlakyCat relies on CodeBERT for code representation and leverages Siamese networks to train a multi-class classifier. We train and evaluate FlakyCat on a set of 451 flaky tests collected from open-source Java projects. Our evaluation shows that FlakyCat categorises flaky tests accurately, with an F1 score of 73%. Furthermore, we investigate the performance of our approach for each category, revealing that Async waits, Unordered collections and Time-related flaky tests are accurately classified, while Concurrency-related flaky tests are more challenging to predict. Finally, to facilitate the comprehension of FlakyCat’s predictions, we present a new technique for CodeBERT-based model interpretability that highlights code statements influencing the categorization.
{"title":"FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning","authors":"Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, Yves Le Traon","doi":"10.1109/AST58925.2023.00018","DOIUrl":"https://doi.org/10.1109/AST58925.2023.00018","url":null,"abstract":"Flaky tests are tests that yield different outcomes when run on the same version of a program. This non-deterministic behaviour plagues continuous integration with false signals, wasting developers’ time and reducing their trust in test suites. Studies highlighted the importance of keeping tests flakiness-free. Recently, the research community has been pushing towards the detection of flaky tests by suggesting many static and dynamic approaches. While promising, those approaches mainly focus on classifying tests as flaky or not and, even when high performances are reported, it remains challenging to understand the cause of flakiness. This part is crucial for researchers and developers that aim to fix it. To help with the comprehension of a given flaky test, we propose FlakyCat, the first approach to classify flaky tests based on their root cause category. FlakyCat relies on CodeBERT for code representation and leverages Siamese networks to train a multi-class classifier. We train and evaluate FlakyCat on a set of 451 flaky tests collected from open-source Java projects. Our evaluation shows that FlakyCat categorises flaky tests accurately, with an F1 score of 73%. Furthermore, we investigate the performance of our approach for each category, revealing that Async waits, Unordered collections and Time-related flaky tests are accurately classified, while Concurrency-related flaky tests are more challenging to predict. Finally, to facilitate the comprehension of FlakyCat’s predictions, we present a new technique for CodeBERT-based model interpretability that highlights code statements influencing the categorization.","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114288161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/AST58925.2023.00012
Soha Hussein, Stephen McCamant, Elena Sherman, Vaibhav Sharma, M. Whalen
Test input generation is one of the key applications of symbolic execution (SE). However, being a path-sensitive technique, SE often faces path explosion even when creating a branch-adequate test suite. Path-merging symbolic execution (PM-SE) alleviates the path explosion problem by summarizing regions of code into disjunctive constraints, thus traversing at once a set of paths with the same prefixes. Previous work has shown that PM-SE can reduce run-time up to 38%, though these improvements can be impaired if the summarized code results in complex constraints or introduces additional symbols that increase the number of branching points in the later execution.Considering these trade-offs, examining the ability of PM-SE to generate branch-adequate test inputs is an open research problem. This paper investigates it by developing a technique that extracts structural coverage-related queries from disjoint constraints. Using this approach, we extend PM-SE to generate branch-adequate test inputs.Experiments compare the effectiveness and efficiency of test input generation by SE and PM-SE techniques. Results show that those techniques are complementary. For some programs, PM-SE yields faster coverage, with fewer generated tests, while for others, SE performs better. In addition, each technique covers branches that the other fails to discover.
{"title":"Structural Test Input Generation for 3-Address Code Coverage Using Path-Merged Symbolic Execution","authors":"Soha Hussein, Stephen McCamant, Elena Sherman, Vaibhav Sharma, M. Whalen","doi":"10.1109/AST58925.2023.00012","DOIUrl":"https://doi.org/10.1109/AST58925.2023.00012","url":null,"abstract":"Test input generation is one of the key applications of symbolic execution (SE). However, being a path-sensitive technique, SE often faces path explosion even when creating a branch-adequate test suite. Path-merging symbolic execution (PM-SE) alleviates the path explosion problem by summarizing regions of code into disjunctive constraints, thus traversing at once a set of paths with the same prefixes. Previous work has shown that PM-SE can reduce run-time up to 38%, though these improvements can be impaired if the summarized code results in complex constraints or introduces additional symbols that increase the number of branching points in the later execution.Considering these trade-offs, examining the ability of PM-SE to generate branch-adequate test inputs is an open research problem. This paper investigates it by developing a technique that extracts structural coverage-related queries from disjoint constraints. Using this approach, we extend PM-SE to generate branch-adequate test inputs.Experiments compare the effectiveness and efficiency of test input generation by SE and PM-SE techniques. Results show that those techniques are complementary. For some programs, PM-SE yields faster coverage, with fewer generated tests, while for others, SE performs better. In addition, each technique covers branches that the other fails to discover.","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121982353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/AST58925.2023.00015
Yavuz Köroglu, F. Wotawa
Vehicle and traffic simulation is a common practice for testing and evaluating advanced driver-assistance systems (ADAS) and autonomous driving (AD). As a result, the literature mentions numerous simulators capable of simulating ADAS/AD implementations. In this study, we investigate previous surveys that cover multiple scenarios and initiate a systematic review targeting simulators for testing ADAS/AD. Our results show that the literature mentions, in total, 181 simulators capable of evaluating one or more ADAS/AD implementations. Furthermore, according to previous surveys and reviews, the most popular simulators are CARLA, Airsim, and SUMO. Finally, our results uncover that every five years, the number of novel simulators added to the literature grows at least quadratically, showing that further review is necessary to address the differences between these simulators and understand the simulator landscape from an ADAS/AD testing perspective.
{"title":"Towards a Review on Simulated ADAS/AD Testing","authors":"Yavuz Köroglu, F. Wotawa","doi":"10.1109/AST58925.2023.00015","DOIUrl":"https://doi.org/10.1109/AST58925.2023.00015","url":null,"abstract":"Vehicle and traffic simulation is a common practice for testing and evaluating advanced driver-assistance systems (ADAS) and autonomous driving (AD). As a result, the literature mentions numerous simulators capable of simulating ADAS/AD implementations. In this study, we investigate previous surveys that cover multiple scenarios and initiate a systematic review targeting simulators for testing ADAS/AD. Our results show that the literature mentions, in total, 181 simulators capable of evaluating one or more ADAS/AD implementations. Furthermore, according to previous surveys and reviews, the most popular simulators are CARLA, Airsim, and SUMO. Finally, our results uncover that every five years, the number of novel simulators added to the literature grows at least quadratically, showing that further review is necessary to address the differences between these simulators and understand the simulator landscape from an ADAS/AD testing perspective.","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122486454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/AST58925.2023.00007
Sushant Kumar Pandey, A. Tripathi
The prediction of whether a software change is fault-inducing or not in the software system using various learning methods, the study concerned in Just-In-Time Software Fault Prediction (JIT-SFP). Building such predicting model requires adequate training data. However, there needs to be more training data at the beginning of the software system. Cross-Project (CP) setting can subjugate this challenge by employing data from different software projects. It can achieve similar predictive performance to Within-Project (WP) fault prediction. It is still being determined to what level the CP training data can be useful in such a situation. Furthermore, it also needs to be discovered whether CP data are helpful in the initial phase of fault detection, and when there is an inadequate WP train set, CP could be beneficial to extend. This article deals with such investigations in real software projects. We proposed a new method by levering a deep belief network and long short-term memory called JITCP-Predictor. Out of ten, the proposed model significantly outperforms every ten project benchmark methods, and it is superior from 10.63% to 136.36% and 7.04% to 35.71% in terms of MCC and F-Measure, respectively. The mean values of MCC and F-Measure produced by JITCP-Predictor are 0.52 ± 0.021 and 0.76 ± 0.76, respectively. We also found that the proposed model is more suitable for large and moderate-size projects. The proposed model avoids class imbalance and overfitting problems and takes reasonable training costs.
{"title":"Cross-Project setting using Deep learning Architectures in Just-In-Time Software Fault Prediction: An Investigation","authors":"Sushant Kumar Pandey, A. Tripathi","doi":"10.1109/AST58925.2023.00007","DOIUrl":"https://doi.org/10.1109/AST58925.2023.00007","url":null,"abstract":"The prediction of whether a software change is fault-inducing or not in the software system using various learning methods, the study concerned in Just-In-Time Software Fault Prediction (JIT-SFP). Building such predicting model requires adequate training data. However, there needs to be more training data at the beginning of the software system. Cross-Project (CP) setting can subjugate this challenge by employing data from different software projects. It can achieve similar predictive performance to Within-Project (WP) fault prediction. It is still being determined to what level the CP training data can be useful in such a situation. Furthermore, it also needs to be discovered whether CP data are helpful in the initial phase of fault detection, and when there is an inadequate WP train set, CP could be beneficial to extend. This article deals with such investigations in real software projects. We proposed a new method by levering a deep belief network and long short-term memory called JITCP-Predictor. Out of ten, the proposed model significantly outperforms every ten project benchmark methods, and it is superior from 10.63% to 136.36% and 7.04% to 35.71% in terms of MCC and F-Measure, respectively. The mean values of MCC and F-Measure produced by JITCP-Predictor are 0.52 ± 0.021 and 0.76 ± 0.76, respectively. We also found that the proposed model is more suitable for large and moderate-size projects. The proposed model avoids class imbalance and overfitting problems and takes reasonable training costs.","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"1246 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114359277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/AST58925.2023.00009
Taejun Lee, Heewon Park, Heejo Lee
In modern software development, open-source software (OSS) plays a crucial role. Although some methods exist to verify the safety of OSS, the current automation technologies fall short. To address this problem, we propose AutoMetric, an automatic technique for measuring security metrics for OSS in repository level. Using AutoMetric which only collects repository addresses of the projects, it is possible to inspect many projects simultaneously regardless of its size and scope. AutoMetric contains five metrics: Mean Time to Update (MU), Mean Time to Commit (MC), Number of Contributors (NC), Inactive Period (IP), and Branch Protection (BP). These metrics can be calculated quickly even if the source code changes. By comparing metrics in AutoMetric with 2,675 reported vulnerabilities in GitHub Advisory Database (GAD), the result shows that the more frequent updates and commits and the shorter the inactivity period, the more vulnerabilities were found.
{"title":"AutoMetric: Towards Measuring Open-Source Software Quality Metrics Automatically","authors":"Taejun Lee, Heewon Park, Heejo Lee","doi":"10.1109/AST58925.2023.00009","DOIUrl":"https://doi.org/10.1109/AST58925.2023.00009","url":null,"abstract":"In modern software development, open-source software (OSS) plays a crucial role. Although some methods exist to verify the safety of OSS, the current automation technologies fall short. To address this problem, we propose AutoMetric, an automatic technique for measuring security metrics for OSS in repository level. Using AutoMetric which only collects repository addresses of the projects, it is possible to inspect many projects simultaneously regardless of its size and scope. AutoMetric contains five metrics: Mean Time to Update (MU), Mean Time to Commit (MC), Number of Contributors (NC), Inactive Period (IP), and Branch Protection (BP). These metrics can be calculated quickly even if the source code changes. By comparing metrics in AutoMetric with 2,675 reported vulnerabilities in GitHub Advisory Database (GAD), the result shows that the more frequent updates and commits and the shorter the inactivity period, the more vulnerabilities were found.","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115128009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-28DOI: 10.1109/AST58925.2023.00014
A. Bertolino, G. D. Angelis, F. Giandomenico, F. Lonetti
Cross-coverage of a program P refers to the test coverage measured over a different program Q that is functionally equivalent to P. The novel concept of cross-coverage can find useful applications in the test of redundant software. We apply here cross-coverage for test suite augmentation and show that additional test cases generated from the coverage of an equivalent program, referred to as cross tests, can increase the coverage of a program in more effective way than a random baseline. We also observe that -contrary to traditional coverage testing-cross coverage could help finding (artificially created) missing functionality faults.
{"title":"Cross-coverage testing of functionally equivalent programs","authors":"A. Bertolino, G. D. Angelis, F. Giandomenico, F. Lonetti","doi":"10.1109/AST58925.2023.00014","DOIUrl":"https://doi.org/10.1109/AST58925.2023.00014","url":null,"abstract":"Cross-coverage of a program P refers to the test coverage measured over a different program Q that is functionally equivalent to P. The novel concept of cross-coverage can find useful applications in the test of redundant software. We apply here cross-coverage for test suite augmentation and show that additional test cases generated from the coverage of an equivalent program, referred to as cross tests, can increase the coverage of a program in more effective way than a random baseline. We also observe that -contrary to traditional coverage testing-cross coverage could help finding (artificially created) missing functionality faults.","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123702187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-17DOI: 10.1109/AST58925.2023.00016
Shawn Rasheed, Jens Dietrich, Amjed Tahir
Test flakiness is a problem that affects testing and processes that rely on it. Several factors cause or influence the flakiness of test outcomes. Test execution order, randomness and concurrency are some of the more common and well-studied causes. Some studies mention code instrumentation as a factor that causes or affects test flakiness. However, evidence for this issue is scarce. In this study, we attempt to systematically collect evidence for the effects of instrumentation on test flakiness. We experiment with common types of instrumentation for Java programs—namely, application performance monitoring, coverage and profiling instrumentation. We then study the effects of instrumentation on a set of nine programs obtained from an existing dataset used to study test flakiness, consisting of popular GitHub projects written in Java. We observe cases where realworld instrumentation causes flakiness in a program. However, this effect is rare. We also discuss a related issue—how instrumentation may interfere with flakiness detection and prevention.
{"title":"On the Effect of Instrumentation on Test Flakiness","authors":"Shawn Rasheed, Jens Dietrich, Amjed Tahir","doi":"10.1109/AST58925.2023.00016","DOIUrl":"https://doi.org/10.1109/AST58925.2023.00016","url":null,"abstract":"Test flakiness is a problem that affects testing and processes that rely on it. Several factors cause or influence the flakiness of test outcomes. Test execution order, randomness and concurrency are some of the more common and well-studied causes. Some studies mention code instrumentation as a factor that causes or affects test flakiness. However, evidence for this issue is scarce. In this study, we attempt to systematically collect evidence for the effects of instrumentation on test flakiness. We experiment with common types of instrumentation for Java programs—namely, application performance monitoring, coverage and profiling instrumentation. We then study the effects of instrumentation on a set of nine programs obtained from an existing dataset used to study test flakiness, consisting of popular GitHub projects written in Java. We observe cases where realworld instrumentation causes flakiness in a program. However, this effect is rare. We also discuss a related issue—how instrumentation may interfere with flakiness detection and prevention.","PeriodicalId":252417,"journal":{"name":"2023 IEEE/ACM International Conference on Automation of Software Test (AST)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115022013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}