Pub Date : 2022-08-28DOI: 10.1109/ICSME55016.2022.00014
Forough Majidi, Moses Openja, Foutse Khomh, Heng Li
The popularity of automated machine learning (AutoML) tools in different domains has increased over the past few years. Machine learning (ML) practitioners use AutoML tools to automate and optimize the process of feature engineering, model training, and hyperparameter optimization and so on. Recent work performed qualitative studies on practitioners’ experiences of using AutoML tools and compared different AutoML tools based on their performance and provided features, but none of the existing work studied the practices of using AutoML tools in real-world projects at a large scale. Therefore, we conducted an empirical study to understand how ML practitioners use AutoML tools in their projects. To this end, we examined the top 10 most used AutoML tools and their respective usages in a large number of open-source project repositories hosted on GitHub. The results of our study show 1) which AutoML tools are mostly used by ML practitioners and 2) the characteristics of the repositories that use these AutoML tools. Also, we identified the purpose of using AutoML tools (e.g. model parameter sampling, search space management, model evaluation/error-analysis, Data/ feature transformation, and data labeling) and the stages of the ML pipeline (e.g. feature engineering) where AutoML tools are used. Finally, we report how often AutoML tools are used together in the same source code files. We hope our results can help ML practitioners learn about different AutoML tools and their usages, so that they can pick the right tool for their purposes. Besides, AutoML tool developers can benefit from our findings to gain insight into the usages of their tools and improve their tools to better fit the users’ usages and needs.
{"title":"An Empirical Study on the Usage of Automated Machine Learning Tools","authors":"Forough Majidi, Moses Openja, Foutse Khomh, Heng Li","doi":"10.1109/ICSME55016.2022.00014","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00014","url":null,"abstract":"The popularity of automated machine learning (AutoML) tools in different domains has increased over the past few years. Machine learning (ML) practitioners use AutoML tools to automate and optimize the process of feature engineering, model training, and hyperparameter optimization and so on. Recent work performed qualitative studies on practitioners’ experiences of using AutoML tools and compared different AutoML tools based on their performance and provided features, but none of the existing work studied the practices of using AutoML tools in real-world projects at a large scale. Therefore, we conducted an empirical study to understand how ML practitioners use AutoML tools in their projects. To this end, we examined the top 10 most used AutoML tools and their respective usages in a large number of open-source project repositories hosted on GitHub. The results of our study show 1) which AutoML tools are mostly used by ML practitioners and 2) the characteristics of the repositories that use these AutoML tools. Also, we identified the purpose of using AutoML tools (e.g. model parameter sampling, search space management, model evaluation/error-analysis, Data/ feature transformation, and data labeling) and the stages of the ML pipeline (e.g. feature engineering) where AutoML tools are used. Finally, we report how often AutoML tools are used together in the same source code files. We hope our results can help ML practitioners learn about different AutoML tools and their usages, so that they can pick the right tool for their purposes. Besides, AutoML tool developers can benefit from our findings to gain insight into the usages of their tools and improve their tools to better fit the users’ usages and needs.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134365813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-16DOI: 10.1109/ICSME55016.2022.00046
Rosalia Tufano, Emad Aghajani, G. Bavota
Reusing code is a common practice in software development: It helps developers speedup the implementation task while also reducing the chances of introducing bugs, given the assumption that the reused code has been tested, possibly in production. Despite these benefits, opportunities for reuse are not always in plain sight and, thus, developers may miss them. We present our preliminary steps in building RETIWA, a recommender able to automatically identify custom implementations in a given project that are good candidates to be replaced by open source APIs. RETIWA relies on a "knowledge base" consisting of real examples of custom implementation-to-API replacements. In this work, we present the mining strategy we tailored to automatically and reliably extract replacements of custom implementations with APIs from open source projects. This is the first step towards building the envisioned recommender.
{"title":"Don’t Reinvent the Wheel: Towards Automatic Replacement of Custom Implementations with APIs","authors":"Rosalia Tufano, Emad Aghajani, G. Bavota","doi":"10.1109/ICSME55016.2022.00046","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00046","url":null,"abstract":"Reusing code is a common practice in software development: It helps developers speedup the implementation task while also reducing the chances of introducing bugs, given the assumption that the reused code has been tested, possibly in production. Despite these benefits, opportunities for reuse are not always in plain sight and, thus, developers may miss them. We present our preliminary steps in building RETIWA, a recommender able to automatically identify custom implementations in a given project that are good candidates to be replaced by open source APIs. RETIWA relies on a \"knowledge base\" consisting of real examples of custom implementation-to-API replacements. In this work, we present the mining strategy we tailored to automatically and reliably extract replacements of custom implementations with APIs from open source projects. This is the first step towards building the envisioned recommender.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123293568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-03DOI: 10.1109/ICSME55016.2022.00050
Jesse Nyyssölä, M. Mäntylä, M. Varela
Software Log anomaly event detection with masked event prediction has various technical approaches with countless configurations and parameters. Our objective is to provide a baseline of settings for similar studies in the future. The models we use are the N-Gram model, which is a classic approach in the field of natural language processing (NLP), and two deep learning (DL) models long short-term memory (LSTM) and convolutional neural network (CNN). For datasets we used four datasets Profilence, BlueGene/L (BGL), Hadoop Distributed File System (HDFS) and Hadoop. Other settings are the size of the sliding window which determines how many surrounding events we are using to predict a given event, mask position (the position within the window we are predicting), the usage of only unique sequences, and the portion of data that is used for training. The results show clear indications of settings that can be generalized across datasets. The performance of the DL models does not deteriorate as the window size increases while the N-Gram model shows worse performance with large window sizes on the BGL and Profilence datasets. Despite the popularity of Next Event Prediction, the results show that in this context it is better not to predict events at the edges of the subsequence, i.e., first or last event, with the best result coming from predicting the fourth event when the window size is five. Regarding the amount of data used for training, the results show differences across datasets and models. For example, the N-Gram model appears to be more sensitive toward the lack of data than the DL models. Overall, for similar experimental setups we suggest the following general baseline: Window size 10, mask position second to last, do not filter out non-unique sequences, and use a half of the total data for training.
基于屏蔽事件预测的软件日志异常事件检测技术手段多种多样,配置和参数无数。我们的目标是为将来类似的研究提供一个基线。我们使用的模型是N-Gram模型,这是自然语言处理(NLP)领域的经典方法,以及两个深度学习(DL)模型长短期记忆(LSTM)和卷积神经网络(CNN)。对于数据集,我们使用了四个数据集Profilence, BlueGene/L (BGL), Hadoop Distributed File System (HDFS)和Hadoop。其他设置包括滑动窗口的大小,它决定了我们使用多少周围事件来预测给定事件,掩码位置(我们预测的窗口内的位置),仅使用唯一序列,以及用于训练的数据部分。结果显示了可以跨数据集推广的设置的明确指示。DL模型的性能不会随着窗口大小的增加而下降,而N-Gram模型在BGL和Profilence数据集上随着窗口大小的增加而表现出更差的性能。尽管下一个事件预测很受欢迎,但结果表明,在这种情况下,最好不要预测子序列边缘的事件,即第一个或最后一个事件,当窗口大小为5时,最好的结果来自预测第四个事件。关于用于训练的数据量,结果显示了数据集和模型之间的差异。例如,N-Gram模型似乎比DL模型对缺乏数据更敏感。总的来说,对于类似的实验设置,我们建议以下一般基线:窗口大小为10,掩码位置倒数第二,不过滤掉非唯一序列,并使用总数据的一半进行训练。
{"title":"How to Configure Masked Event Anomaly Detection on Software Logs?","authors":"Jesse Nyyssölä, M. Mäntylä, M. Varela","doi":"10.1109/ICSME55016.2022.00050","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00050","url":null,"abstract":"Software Log anomaly event detection with masked event prediction has various technical approaches with countless configurations and parameters. Our objective is to provide a baseline of settings for similar studies in the future. The models we use are the N-Gram model, which is a classic approach in the field of natural language processing (NLP), and two deep learning (DL) models long short-term memory (LSTM) and convolutional neural network (CNN). For datasets we used four datasets Profilence, BlueGene/L (BGL), Hadoop Distributed File System (HDFS) and Hadoop. Other settings are the size of the sliding window which determines how many surrounding events we are using to predict a given event, mask position (the position within the window we are predicting), the usage of only unique sequences, and the portion of data that is used for training. The results show clear indications of settings that can be generalized across datasets. The performance of the DL models does not deteriorate as the window size increases while the N-Gram model shows worse performance with large window sizes on the BGL and Profilence datasets. Despite the popularity of Next Event Prediction, the results show that in this context it is better not to predict events at the edges of the subsequence, i.e., first or last event, with the best result coming from predicting the fourth event when the window size is five. Regarding the amount of data used for training, the results show differences across datasets and models. For example, the N-Gram model appears to be more sensitive toward the lack of data than the DL models. Overall, for similar experimental setups we suggest the following general baseline: Window size 10, mask position second to last, do not filter out non-unique sequences, and use a half of the total data for training.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"227 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121040449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-02DOI: 10.1109/ICSME55016.2022.00054
Eric Ribeiro, Ronan Nascimento, Igor Steinmacher, Laerte Xavier, M. Gerosa, Hugo de Paula, M. Wessel
Software bots connect users and tools, streamlining the pull request review process in social coding platforms. However, bots can introduce information overload into developers’ communication. Information overload is especially problematic for newcomers, who are still exploring the project and may feel overwhelmed by the number of messages. Inspired by the literature of other domains, we designed and evaluated FunnelBot, a bot that acts as a mediator between developers and other bots in the repository. We conducted a within-subject study with 25 newcomers to capture their perceptions and preferences. Our results provide insights for bot developers who want to mitigate noise and create bots for supporting newcomers, laying a foundation for designing better bots.
{"title":"Together or Apart? Investigating a mediator bot to aggregate bot’s comments on pull requests","authors":"Eric Ribeiro, Ronan Nascimento, Igor Steinmacher, Laerte Xavier, M. Gerosa, Hugo de Paula, M. Wessel","doi":"10.1109/ICSME55016.2022.00054","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00054","url":null,"abstract":"Software bots connect users and tools, streamlining the pull request review process in social coding platforms. However, bots can introduce information overload into developers’ communication. Information overload is especially problematic for newcomers, who are still exploring the project and may feel overwhelmed by the number of messages. Inspired by the literature of other domains, we designed and evaluated FunnelBot, a bot that acts as a mediator between developers and other bots in the repository. We conducted a within-subject study with 25 newcomers to capture their perceptions and preferences. Our results provide insights for bot developers who want to mitigate noise and create bots for supporting newcomers, laying a foundation for designing better bots.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123024913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-02DOI: 10.1109/ICSME55016.2022.00043
Tim Puhlfurss, Lloyd Montgomery, W. Maalej
[Background] In large open-source software projects, development knowledge is often fragmented across multiple artefacts and contributors such that individual stakeholders are generally unaware of the full breadth of the product features. However, users want to know what the software is capable of, while contributors need to know where to fix, update, and add features. [Objective] This work aims at understanding how feature knowledge is documented in GitHub projects and how it is linked (if at all) to the source code. [Method] We conducted an in-depth qualitative exploratory content analysis of 25 popular GitHub repositories that provided the documentation artefacts recommended by GitHub’s Community Standards indicator. We first extracted strategies used to document software features in textual artefacts and then strategies used to link the feature documentation with source code. [Results] We observed feature documentation in all studied projects in artefacts such as READMEs, wikis, and website resource files. However, the features were often described in an unstructured way. Additionally, tracing techniques to connect feature documentation and source code were rarely used. [Conclusions] Our results suggest a lacking (or a low-prioritised) feature documentation in open-source projects, little use of normalised structures, and a rare explicit referencing to source code. As a result, product feature traceability is likely to be very limited, and maintainability to suffer over time.
{"title":"An Exploratory Study of Documentation Strategies for Product Features in Popular GitHub Projects","authors":"Tim Puhlfurss, Lloyd Montgomery, W. Maalej","doi":"10.1109/ICSME55016.2022.00043","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00043","url":null,"abstract":"[Background] In large open-source software projects, development knowledge is often fragmented across multiple artefacts and contributors such that individual stakeholders are generally unaware of the full breadth of the product features. However, users want to know what the software is capable of, while contributors need to know where to fix, update, and add features. [Objective] This work aims at understanding how feature knowledge is documented in GitHub projects and how it is linked (if at all) to the source code. [Method] We conducted an in-depth qualitative exploratory content analysis of 25 popular GitHub repositories that provided the documentation artefacts recommended by GitHub’s Community Standards indicator. We first extracted strategies used to document software features in textual artefacts and then strategies used to link the feature documentation with source code. [Results] We observed feature documentation in all studied projects in artefacts such as READMEs, wikis, and website resource files. However, the features were often described in an unstructured way. Additionally, tracing techniques to connect feature documentation and source code were rarely used. [Conclusions] Our results suggest a lacking (or a low-prioritised) feature documentation in open-source projects, little use of normalised structures, and a rare explicit referencing to source code. As a result, product feature traceability is likely to be very limited, and maintainability to suffer over time.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114669175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-30DOI: 10.1109/ICSME55016.2022.00042
Fuwei Tian, Christoph Treude
Deep learning models have been successfully applied to a variety of software engineering tasks, such as code classification, summarisation, and bug and vulnerability detection. In order to apply deep learning to these tasks, source code needs to be represented in a format that is suitable for input into the deep learning model. Most approaches to representing source code, such as tokens, abstract syntax trees (ASTs), data flow graphs (DFGs), and control flow graphs (CFGs) only focus on the code itself and do not take into account additional context that could be useful for deep learning models. In this paper, we argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed. We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model for two software engineering tasks. We outline our research agenda for adding further contextual information to source code representations for deep learning.
{"title":"Adding Context to Source Code Representations for Deep Learning","authors":"Fuwei Tian, Christoph Treude","doi":"10.1109/ICSME55016.2022.00042","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00042","url":null,"abstract":"Deep learning models have been successfully applied to a variety of software engineering tasks, such as code classification, summarisation, and bug and vulnerability detection. In order to apply deep learning to these tasks, source code needs to be represented in a format that is suitable for input into the deep learning model. Most approaches to representing source code, such as tokens, abstract syntax trees (ASTs), data flow graphs (DFGs), and control flow graphs (CFGs) only focus on the code itself and do not take into account additional context that could be useful for deep learning models. In this paper, we argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed. We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model for two software engineering tasks. We outline our research agenda for adding further contextual information to source code representations for deep learning.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"237 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134074233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-30DOI: 10.1109/ICSME55016.2022.00045
Pascal André, Quentin Sti'evenart, Mohammad Ghafari
WebAssembly is a growing technology to build cross-platform applications. We aim to understand the security issues that developers encounter when adopting WebAssembly. We mined WebAssembly questions on Stack Overflow and identified 359 security-related posts. We classified these posts into 8 themes, reflecting developer intentions, and 19 topics, representing developer issues in this domain. We found that the most prevalent themes are related to bug fix support, requests for how to implement particular features, clarification questions, and setup or configuration issues. We noted that the topmost issues attribute to authentication in Blazor WebAssembly. We discuss six of them and provide our suggestions to clear these issues in practice.
{"title":"Developers Struggle with Authentication in Blazor WebAssembly","authors":"Pascal André, Quentin Sti'evenart, Mohammad Ghafari","doi":"10.1109/ICSME55016.2022.00045","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00045","url":null,"abstract":"WebAssembly is a growing technology to build cross-platform applications. We aim to understand the security issues that developers encounter when adopting WebAssembly. We mined WebAssembly questions on Stack Overflow and identified 359 security-related posts. We classified these posts into 8 themes, reflecting developer intentions, and 19 topics, representing developer issues in this domain. We found that the most prevalent themes are related to bug fix support, requests for how to implement particular features, clarification questions, and setup or configuration issues. We noted that the topmost issues attribute to authentication in Blazor WebAssembly. We discuss six of them and provide our suggestions to clear these issues in practice.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124600285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-26DOI: 10.1109/ICSME55016.2022.00067
Tomás Fiedor, Jirí Pavela, Adam Rogalewicz, Tomáš Vojnar
In this paper, we present PERUN: an open-source tool suite for profiling-based performance analysis. At its core, PERUN maintains links between project versions and the corresponding stored performance profiles, which are then leveraged for automated detection of performance changes in new project versions. The PERUN tool suite further includes multiple profilers (and is designed such that further profilers can be easily added), a performance fuzz-tester for workload generation, methods for deriving performance models, and numerous visualization methods. We demonstrate how PERUN can help developers to analyze their program performance on two examples: detection and localization of a performance degradation and generation of inputs forcing performance issues to show up.
{"title":"Perun: Performance Version System","authors":"Tomás Fiedor, Jirí Pavela, Adam Rogalewicz, Tomáš Vojnar","doi":"10.1109/ICSME55016.2022.00067","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00067","url":null,"abstract":"In this paper, we present PERUN: an open-source tool suite for profiling-based performance analysis. At its core, PERUN maintains links between project versions and the corresponding stored performance profiles, which are then leveraged for automated detection of performance changes in new project versions. The PERUN tool suite further includes multiple profilers (and is designed such that further profilers can be easily added), a performance fuzz-tester for workload generation, methods for deriving performance models, and numerous visualization methods. We demonstrate how PERUN can help developers to analyze their program performance on two examples: detection and localization of a performance degradation and generation of inputs forcing performance issues to show up.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130812606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-20DOI: 10.1109/ICSME55016.2022.00039
Sarra Habchi, Guillaume Haben, Jeongju Sohn, Adriano Franci, Mike Papadakis, Maxime Cordy, Yves Le Traon
Flaky tests are defined as tests that manifest non-deterministic behaviour by passing and failing intermittently for the same version of the code. These tests cripple continuous integration with false alerts that waste developers’ time and break their trust in regression testing. To mitigate the effects of flakiness, both researchers and industrial experts proposed strategies and tools to detect and isolate flaky tests. However, flaky tests are rarely fixed as developers struggle to localise and understand their causes. Additionally, developers working with large codebases often need to know the sources of non-determinism to preserve code quality, i.e., avoid introducing technical debt linked with non-deterministic behaviour, and to avoid introducing new flaky tests. To aid with these tasks, we propose re-targeting Fault Localisation techniques to the flaky component localisation problem, i.e., pinpointing program classes that cause the non-deterministic behaviour of flaky tests. In particular, we employ Spectrum-Based Fault Localisation (SBFL), a coverage-based fault localisation technique commonly adopted for its simplicity and effectiveness. We also utilise other data sources, such as change history and static code metrics, to further improve the localisation. Our results show that augmenting SBFL with change and code metrics ranks flaky classes in the top-1 and top-5 suggestions, in 26% and 47% of the cases. Overall, we successfully reduced the average number of classes inspected to locate the first flaky class to 19% of the total number of classes covered by flaky tests. Our results also show that localisation methods are effective in major flakiness categories, such as concurrency and asynchronous waits, indicating their general ability to identify flaky components.
{"title":"What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness","authors":"Sarra Habchi, Guillaume Haben, Jeongju Sohn, Adriano Franci, Mike Papadakis, Maxime Cordy, Yves Le Traon","doi":"10.1109/ICSME55016.2022.00039","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00039","url":null,"abstract":"Flaky tests are defined as tests that manifest non-deterministic behaviour by passing and failing intermittently for the same version of the code. These tests cripple continuous integration with false alerts that waste developers’ time and break their trust in regression testing. To mitigate the effects of flakiness, both researchers and industrial experts proposed strategies and tools to detect and isolate flaky tests. However, flaky tests are rarely fixed as developers struggle to localise and understand their causes. Additionally, developers working with large codebases often need to know the sources of non-determinism to preserve code quality, i.e., avoid introducing technical debt linked with non-deterministic behaviour, and to avoid introducing new flaky tests. To aid with these tasks, we propose re-targeting Fault Localisation techniques to the flaky component localisation problem, i.e., pinpointing program classes that cause the non-deterministic behaviour of flaky tests. In particular, we employ Spectrum-Based Fault Localisation (SBFL), a coverage-based fault localisation technique commonly adopted for its simplicity and effectiveness. We also utilise other data sources, such as change history and static code metrics, to further improve the localisation. Our results show that augmenting SBFL with change and code metrics ranks flaky classes in the top-1 and top-5 suggestions, in 26% and 47% of the cases. Overall, we successfully reduced the average number of classes inspected to locate the first flaky class to 19% of the total number of classes covered by flaky tests. Our results also show that localisation methods are effective in major flakiness categories, such as concurrency and asynchronous waits, indicating their general ability to identify flaky components.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130975117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-03DOI: 10.1109/ICSME55016.2022.00011
Negar Hashemi, Amjed Tahir, Shawn Rasheed
Flaky tests (tests with non-deterministic outcomes) can be problematic for testing efficiency and software reliability. Flaky tests in test suites can also significantly delay software releases. There have been several studies that attempt to quantify the impact of test flakiness in different programming languages (e.g., Java and Python) and application domains (e.g., mobile and GUI-based). In this paper, we conduct an empirical study of the state of flaky tests in JavaScript. We investigate two aspects of flaky tests in JavaScript projects: the main causes of flaky tests in these projects and common fixing strategies. By analysing 452 commits from large, top-scoring JavaScript projects from GitHub, we found that flakiness caused by concurrency-related issues (e.g., async wait, race conditions or deadlocks) is the most dominant reason for test flakiness. The other top causes of flaky tests are operating system-specific (e.g., features that work on specific OS or OS versions) and network stability (e.g., internet availability or bad socket management). In terms of how flaky tests are dealt with, the majority of those flaky tests (>80%) are fixed to eliminate flaky behaviour and developers sometimes skip, quarantine or remove flaky tests.
{"title":"An Empirical Study of Flaky Tests in JavaScript","authors":"Negar Hashemi, Amjed Tahir, Shawn Rasheed","doi":"10.1109/ICSME55016.2022.00011","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00011","url":null,"abstract":"Flaky tests (tests with non-deterministic outcomes) can be problematic for testing efficiency and software reliability. Flaky tests in test suites can also significantly delay software releases. There have been several studies that attempt to quantify the impact of test flakiness in different programming languages (e.g., Java and Python) and application domains (e.g., mobile and GUI-based). In this paper, we conduct an empirical study of the state of flaky tests in JavaScript. We investigate two aspects of flaky tests in JavaScript projects: the main causes of flaky tests in these projects and common fixing strategies. By analysing 452 commits from large, top-scoring JavaScript projects from GitHub, we found that flakiness caused by concurrency-related issues (e.g., async wait, race conditions or deadlocks) is the most dominant reason for test flakiness. The other top causes of flaky tests are operating system-specific (e.g., features that work on specific OS or OS versions) and network stability (e.g., internet availability or bad socket management). In terms of how flaky tests are dealt with, the majority of those flaky tests (>80%) are fixed to eliminate flaky behaviour and developers sometimes skip, quarantine or remove flaky tests.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129835506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}