Pub Date : 2024-08-02DOI: 10.1007/s10664-024-10514-z
Hareem Sahar, Abdul Ali Bangash, Abram Hindle, Denilson Barbosa
Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time. Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models. These models typically involve extensive training processes that may require significant computational resources and time. These characteristics can pose challenges when attempting to update the models in real-time as new examples become available, potentially impacting their suitability for fast online defect prediction. Furthermore, the reliance on a complex underlying model makes these approaches often less explainable, which means the developers cannot understand the reasons behind models’ predictions. An approach that is not explainable might not be adopted in real-life development environments because of developers’ lack of trust in its results. To address these limitations, we propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits. IRJIT approach is online and explainable as it can learn from new data without expensive retraining, and developers can see the documents that support a prediction, providing additional context. By evaluating 10 open-source datasets in a within project setting, we show that our approach is up to 112 times faster than the state-of-the-art ML and DL approaches, offers explainability at the commit and line level, and has comparable performance to the state-of-the-art.
{"title":"IRJIT: A simple, online, information retrieval approach for just-in-time software defect prediction","authors":"Hareem Sahar, Abdul Ali Bangash, Abram Hindle, Denilson Barbosa","doi":"10.1007/s10664-024-10514-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10514-z","url":null,"abstract":"<p>Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time. Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models. These models typically involve extensive training processes that may require significant computational resources and time. These characteristics can pose challenges when attempting to update the models in real-time as new examples become available, potentially impacting their suitability for fast online defect prediction. Furthermore, the reliance on a complex underlying model makes these approaches often less <i>explainable</i>, which means the developers cannot understand the reasons behind models’ predictions. An approach that is not <i>explainable</i> might not be adopted in real-life development environments because of developers’ lack of trust in its results. To address these limitations, we propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits. IRJIT approach is <i>online</i> and <i>explainable</i> as it can learn from new data without expensive retraining, and developers can see the documents that support a prediction, providing additional context. By evaluating 10 open-source datasets in a within project setting, we show that our approach is up to 112 times faster than the state-of-the-art ML and DL approaches, offers explainability at the commit and line level, and has comparable performance to the state-of-the-art.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"182 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1007/s10664-024-10502-3
Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, Emelie Engström
Despite the accuracy of machine learning (ML) techniques in predicting invalid bug reports, as shown in earlier research, and the importance of early identification of invalid bug reports in software maintenance, the adoption of ML techniques for this task in industrial practice is yet to be investigated. In this study, we used a technology transfer model to guide the adoption of an ML technique at a company for the early identification of invalid bug reports. In the process, we also identify necessary conditions for adopting such techniques in practice. We followed a case study research approach with various design and analysis iterations for technology transfer activities. We collected data from bug repositories, through focus groups, a questionnaire, and a presentation and feedback session with an expert. As expected, we found that an ML technique can identify invalid bug reports with acceptable accuracy at an early stage. However, the technique’s accuracy drops over time in its operational use due to changes in the product, the used technologies, or the development organization. Such changes may require retraining the ML model. During validation, practitioners highlighted the need to understand the ML technique’s predictions to trust the predictions. We found that a visual (using a state-of-the-art ML interpretation framework) and descriptive explanation of the prediction increases the trustability of the technique compared to just presenting the results of the validity predictions. We conclude that trustability, integration with the existing toolchain, and maintaining the techniques’ accuracy over time are critical for increasing the likelihood of adoption.
尽管早先的研究表明,机器学习(ML)技术在预测无效错误报告方面具有很高的准确性,而且早期识别无效错误报告在软件维护中也非常重要,但在工业实践中采用 ML 技术来完成这项任务还有待研究。在本研究中,我们使用技术转移模型来指导一家公司采用 ML 技术来早期识别无效错误报告。在此过程中,我们还确定了在实践中采用此类技术的必要条件。我们采用案例研究的方法,对技术转让活动进行了各种设计和分析迭代。我们从错误库中收集数据,通过焦点小组、问卷调查以及与专家的演示和反馈会议。不出所料,我们发现人工智能技术可以在早期阶段以可接受的准确度识别出无效的错误报告。然而,随着时间的推移,由于产品、所用技术或开发组织的变化,该技术的准确性会在实际使用中下降。这种变化可能需要重新训练 ML 模型。在验证过程中,实践者强调需要理解 ML 技术的预测,以便信任其预测结果。我们发现,对预测进行可视化(使用最先进的 ML 解释框架)和描述性解释,比仅仅展示有效性预测结果更能提高技术的可信度。我们的结论是,可信任度、与现有工具链的整合以及长期保持技术的准确性对于提高采用的可能性至关重要。
{"title":"Industrial adoption of machine learning techniques for early identification of invalid bug reports","authors":"Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, Emelie Engström","doi":"10.1007/s10664-024-10502-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10502-3","url":null,"abstract":"<p>Despite the accuracy of machine learning (ML) techniques in predicting invalid bug reports, as shown in earlier research, and the importance of early identification of invalid bug reports in software maintenance, the adoption of ML techniques for this task in industrial practice is yet to be investigated. In this study, we used a technology transfer model to guide the adoption of an ML technique at a company for the early identification of invalid bug reports. In the process, we also identify necessary conditions for adopting such techniques in practice. We followed a case study research approach with various design and analysis iterations for technology transfer activities. We collected data from bug repositories, through focus groups, a questionnaire, and a presentation and feedback session with an expert. As expected, we found that an ML technique can identify invalid bug reports with acceptable accuracy at an early stage. However, the technique’s accuracy drops over time in its operational use due to changes in the product, the used technologies, or the development organization. Such changes may require retraining the ML model. During validation, practitioners highlighted the need to understand the ML technique’s predictions to trust the predictions. We found that a visual (using a state-of-the-art ML interpretation framework) and descriptive explanation of the prediction increases the trustability of the technique compared to just presenting the results of the validity predictions. We conclude that trustability, integration with the existing toolchain, and maintaining the techniques’ accuracy over time are critical for increasing the likelihood of adoption.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"42 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1007/s10664-024-10523-y
Hocine Rebatchi, Tégawendé F. Bissyandé, Naouel Moha
<p>Modern software development is a complex engineering process where developer code cohabits with an increasingly larger number of external open-source components. Even though these components facilitate sharing and reusing code along with other benefits related to maintenance and code quality, they are often the seeds of vulnerabilities in the software supply chain leading to attacks with severe consequences. Indeed, one common strategy used to conduct attacks is to exploit or inject other security flaws in new versions of dependency packages. It is thus important to keep dependencies updated in a software development project. Unfortunately, several prior studies have highlighted that, to a large extent, developers struggle to keep track of the dependency package updates, and do not quickly incorporate security patches. Therefore, automated dependency-update bots have been proposed to mitigate the impact and the emergence of vulnerabilities in open-source projects. In our study, we focus on Dependabot, a dependency management bot that has gained popularity on GitHub recently. It allows developers to keep a lookout on project dependencies and reduce the effort of monitoring the safety of the software supply chain. We performed a large empirical study on dependency updates and security pull requests to understand: (1) the degree and reasons of Dependabot’s popularity; (2) the patterns of developers’ practices and techniques to deal with vulnerabilities in dependencies; (3) the management of security pull requests (PRs), the threat lifetime, and the fix delay; and (4) the factors that significantly correlate with the acceptance of security PRs and fast merges. To that end, we collected a dataset of 9,916,318 pull request-related issues made in 1,743,035 projects on GitHub for more than 10 different programming languages. In addition to the comprehensive quantitative analysis, we performed a manual qualitative analysis on a representative sample of the dataset, and we substantiated our findings by sending a survey to developers that use dependency management tools. Our study shows that Dependabot dominates more than 65% of dependency management activity, mainly due to its efficiency, accessibility, adaptivity, and availability of support. We also found that developers handle dependency vulnerabilities differently, but mainly rely on the automation of PRs generation to upgrade vulnerable dependencies. Interestingly, Dependabot’s and developers’ security PRs are highly accepted, and the automation allows to accelerate their management, so that fixes are applied in less than one day. However, the threat of dependency vulnerabilities remains hidden for 512 days on average, and patches are disclosed after 362 days due to the reliance on the manual effort of security experts. Also, project characteristics, the amount of PR changes, as well as developer and dependency features seem to be highly correlated with the acceptance and fast merges of security PR
{"title":"Dependabot and security pull requests: large empirical study","authors":"Hocine Rebatchi, Tégawendé F. Bissyandé, Naouel Moha","doi":"10.1007/s10664-024-10523-y","DOIUrl":"https://doi.org/10.1007/s10664-024-10523-y","url":null,"abstract":"<p>Modern software development is a complex engineering process where developer code cohabits with an increasingly larger number of external open-source components. Even though these components facilitate sharing and reusing code along with other benefits related to maintenance and code quality, they are often the seeds of vulnerabilities in the software supply chain leading to attacks with severe consequences. Indeed, one common strategy used to conduct attacks is to exploit or inject other security flaws in new versions of dependency packages. It is thus important to keep dependencies updated in a software development project. Unfortunately, several prior studies have highlighted that, to a large extent, developers struggle to keep track of the dependency package updates, and do not quickly incorporate security patches. Therefore, automated dependency-update bots have been proposed to mitigate the impact and the emergence of vulnerabilities in open-source projects. In our study, we focus on Dependabot, a dependency management bot that has gained popularity on GitHub recently. It allows developers to keep a lookout on project dependencies and reduce the effort of monitoring the safety of the software supply chain. We performed a large empirical study on dependency updates and security pull requests to understand: (1) the degree and reasons of Dependabot’s popularity; (2) the patterns of developers’ practices and techniques to deal with vulnerabilities in dependencies; (3) the management of security pull requests (PRs), the threat lifetime, and the fix delay; and (4) the factors that significantly correlate with the acceptance of security PRs and fast merges. To that end, we collected a dataset of 9,916,318 pull request-related issues made in 1,743,035 projects on GitHub for more than 10 different programming languages. In addition to the comprehensive quantitative analysis, we performed a manual qualitative analysis on a representative sample of the dataset, and we substantiated our findings by sending a survey to developers that use dependency management tools. Our study shows that Dependabot dominates more than 65% of dependency management activity, mainly due to its efficiency, accessibility, adaptivity, and availability of support. We also found that developers handle dependency vulnerabilities differently, but mainly rely on the automation of PRs generation to upgrade vulnerable dependencies. Interestingly, Dependabot’s and developers’ security PRs are highly accepted, and the automation allows to accelerate their management, so that fixes are applied in less than one day. However, the threat of dependency vulnerabilities remains hidden for 512 days on average, and patches are disclosed after 362 days due to the reliance on the manual effort of security experts. Also, project characteristics, the amount of PR changes, as well as developer and dependency features seem to be highly correlated with the acceptance and fast merges of security PR","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"296 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the prevalent application of machine learning (ML) techniques and the intrinsic black-box nature of ML models, the need for good explanations that are sufficient and necessary towards locally interpreting a model’s prediction has been well recognized and emphasized. Existing explanation approaches, however, favor either the sufficiency or necessity. To fill this gap, in this paper, we propose an approach for generating local explanations that are both sufficient and necessary. Our approach, DDImage, automatically produces local explanations for ML-based image classifiers in a post-hoc way. The core idea behind DDImage is to discover an appropriate explanation by debugging the given input image via a series of image reductions, with respect to the sufficiency and necessity properties. Evaluation of DDImage using publicly available datasets and popular classification models reveals its effectiveness and efficiency. Compared with three state-of-the-art approaches, DDImage demonstrates a superior performance in producing small-sized explanations preserving both sufficiency and necessity, and it also shows promising stability and efficiency. We also identify the impact of segmentation granularity, reveal the performance variance for different target models, and further show that our approach is applicable across different problem domains.
由于机器学习(ML)技术的普遍应用和 ML 模型固有的黑箱性质,人们充分认识到并强调了对模型预测进行充分和必要的良好解释的必要性。然而,现有的解释方法倾向于充分性或必要性。为了填补这一空白,我们在本文中提出了一种生成既充分又必要的本地解释的方法。我们的方法,即 DDImage,能以事后的方式为基于 ML 的图像分类器自动生成局部解释。DDImage 背后的核心理念是通过一系列图像还原来调试给定的输入图像,从而根据充分性和必要性属性发现适当的解释。使用公开可用的数据集和流行的分类模型对 DDImage 进行的评估显示了其有效性和效率。与三种最先进的方法相比,DDImage 在生成同时保留充分性和必要性的小尺寸解释方面表现出色,而且还显示出良好的稳定性和效率。我们还确定了细分粒度的影响,揭示了不同目标模型的性能差异,并进一步表明我们的方法适用于不同的问题领域。
{"title":"DDImage: an image reduction based approach for automatically explaining black-box classifiers","authors":"Mingyue Jiang, Chengjian Tang, Xiao-Yi Zhang, Yangyang Zhao, Zuohua Ding","doi":"10.1007/s10664-024-10505-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10505-0","url":null,"abstract":"<p>Due to the prevalent application of machine learning (ML) techniques and the intrinsic black-box nature of ML models, the need for good explanations that are sufficient and necessary towards locally interpreting a model’s prediction has been well recognized and emphasized. Existing explanation approaches, however, favor either the sufficiency or necessity. To fill this gap, in this paper, we propose an approach for generating local explanations that are both sufficient and necessary. Our approach, DDImage, automatically produces local explanations for ML-based image classifiers in a post-hoc way. The core idea behind DDImage is to discover an appropriate explanation by debugging the given input image via a series of image reductions, with respect to the sufficiency and necessity properties. Evaluation of DDImage using publicly available datasets and popular classification models reveals its effectiveness and efficiency. Compared with three state-of-the-art approaches, DDImage demonstrates a superior performance in producing small-sized explanations preserving both sufficiency and necessity, and it also shows promising stability and efficiency. We also identify the impact of segmentation granularity, reveal the performance variance for different target models, and further show that our approach is applicable across different problem domains.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"214 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1007/s10664-024-10507-y
Markus Borg, Leif Jonsson, Emelie Engström, Béla Bartalos, Attila Szabó
[Context] The continuous inflow of bug reports is a considerable challenge in large development projects. Inspired by contemporary work on mining software repositories, we designed a prototype bug assignment solution based on machine learning in 2011-2016. The prototype evolved into an internal Ericsson product, TRR, in 2017-2018. TRR’s first bug assignment without human intervention happened in April 2019. [Objective] Our study evaluates the adoption of TRR within its industrial context at Ericsson, i.e., we provide lessons learned related to the productization of a research prototype within a company. Moreover, we investigate 1) how TRR performs in the field, 2) what value TRR provides to Ericsson, and 3) how TRR has influenced the ways of working. [Method] We conduct a preregistered industrial case study combining interviews with TRR stakeholders, minutes from sprint planning meetings, and bug-tracking data. The data analysis includes thematic analysis, descriptive statistics, and Bayesian causal analysis. [Results] TRR is now an incorporated part of the bug assignment process. Considering the abstraction levels of the telecommunications stack, high-level modules are more positive while low-level modules experienced some drawbacks. Most importantly, some bug reports directly reach low-level modules without first having passed through fundamental root-cause analysis steps at higher levels. On average, TRR automatically assigns 30% of the incoming bug reports with an accuracy of 75%. Auto-routed TRs are resolved around 21% faster within Ericsson, and TRR has saved highly seasoned engineers many hours of work. Indirect effects of adopting TRR include process improvements, process awareness, increased communication, and higher job satisfaction. [Conclusions] TRR has saved time at Ericsson, but the adoption of automated bug assignment was more intricate compared to similar endeavors reported from other companies. We primarily attribute the difference to the very large size of the organization and the complex products. Key facilitators in the successful adoption include a gradual introduction, product champions, and careful stakeholder analysis.
{"title":"Adopting automated bug assignment in practice — a longitudinal case study at Ericsson","authors":"Markus Borg, Leif Jonsson, Emelie Engström, Béla Bartalos, Attila Szabó","doi":"10.1007/s10664-024-10507-y","DOIUrl":"https://doi.org/10.1007/s10664-024-10507-y","url":null,"abstract":"<p>[Context] The continuous inflow of bug reports is a considerable challenge in large development projects. Inspired by contemporary work on mining software repositories, we designed a prototype bug assignment solution based on machine learning in 2011-2016. The prototype evolved into an internal Ericsson product, TRR, in 2017-2018. TRR’s first bug assignment without human intervention happened in April 2019. [Objective] Our study evaluates the adoption of TRR within its industrial context at Ericsson, i.e., we provide lessons learned related to the productization of a research prototype within a company. Moreover, we investigate 1) how TRR performs in the field, 2) what value TRR provides to Ericsson, and 3) how TRR has influenced the ways of working. [Method] We conduct a preregistered industrial case study combining interviews with TRR stakeholders, minutes from sprint planning meetings, and bug-tracking data. The data analysis includes thematic analysis, descriptive statistics, and Bayesian causal analysis. [Results] TRR is now an incorporated part of the bug assignment process. Considering the abstraction levels of the telecommunications stack, high-level modules are more positive while low-level modules experienced some drawbacks. Most importantly, some bug reports directly reach low-level modules without first having passed through fundamental root-cause analysis steps at higher levels. On average, TRR automatically assigns 30% of the incoming bug reports with an accuracy of 75%. Auto-routed TRs are resolved around 21% faster within Ericsson, and TRR has saved highly seasoned engineers many hours of work. Indirect effects of adopting TRR include process improvements, process awareness, increased communication, and higher job satisfaction. [Conclusions] TRR has saved time at Ericsson, but the adoption of automated bug assignment was more intricate compared to similar endeavors reported from other companies. We primarily attribute the difference to the very large size of the organization and the complex products. Key facilitators in the successful adoption include a gradual introduction, product champions, and careful stakeholder analysis.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"74 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1007/s10664-024-10530-z
Michel Maes-Bermejo, Micael Gallego, Francisco Gortázar, Gregorio Robles, Jesus M. Gonzalez-Barahona
Building past snapshots of a software project has shown to be of interest both for researchers and practitioners. However, little attention has been devoted specifically to tests available in those past snapshots, which are fundamental for the maintenance of old versions still in production. The aim of this study is to determine to which extent tests of past snapshots can be executed successfully, which would mean these past snapshots are still testable. Given a software project, we build all its past snapshots from source code, including tests, and then run the tests. When tests do not result in success, we also record the reasons, allowing us to determine factors that make tests fail. We run this method on a total of 86 Java projects. On average, for 52.53% of the project snapshots on which tests can be built, all tests pass. However, on average, 94.14% of tests pass in previous snapshots when we take into account the percentage of tests passing in the snapshots used for building those tests. In real software projects, successfully running tests in past snapshots is not something that we can take for granted: we have found that in a large proportion of the projects we studied this does not happen frequently. We have found that the building from source code is the main limitation when running tests on past snapshots. However, we have found some projects for which tests run successfully in a very large fraction of past snapshots, which allows us to identify good practices. We also provide a framework and metrics to quantify testability (the extent to which we are able to run tests of a snapshot with a success result) of past snapshots from several points of view, which simplifies new analyses on this matter, and could help to measure how any project performs in this respect.
{"title":"Testing the past: can we still run tests in past snapshots for Java projects?","authors":"Michel Maes-Bermejo, Micael Gallego, Francisco Gortázar, Gregorio Robles, Jesus M. Gonzalez-Barahona","doi":"10.1007/s10664-024-10530-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10530-z","url":null,"abstract":"<p>Building past snapshots of a software project has shown to be of interest both for researchers and practitioners. However, little attention has been devoted specifically to tests available in those past snapshots, which are fundamental for the maintenance of old versions still in production. The aim of this study is to determine to which extent tests of past snapshots can be executed successfully, which would mean these past snapshots are still testable. Given a software project, we build all its past snapshots from source code, including tests, and then run the tests. When tests do not result in success, we also record the reasons, allowing us to determine factors that make tests fail. We run this method on a total of 86 Java projects. On average, for 52.53% of the project snapshots on which tests can be built, all tests pass. However, on average, 94.14% of tests pass in previous snapshots when we take into account the percentage of tests passing in the snapshots used for building those tests. In real software projects, successfully running tests in past snapshots is not something that we can take for granted: we have found that in a large proportion of the projects we studied this does not happen frequently. We have found that the building from source code is the main limitation when running tests on past snapshots. However, we have found some projects for which tests run successfully in a very large fraction of past snapshots, which allows us to identify good practices. We also provide a framework and metrics to quantify testability (the extent to which we are able to run tests of a snapshot with a success result) of past snapshots from several points of view, which simplifies new analyses on this matter, and could help to measure how any project performs in this respect.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"78 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1007/s10664-024-10498-w
Miguel Setúbal, Tayana Conte, Marcos Kalinowski, Allysson Allex Araújo
The growing software development market has increased the demand for qualified professionals in Software Engineering (SE). To this end, companies must enhance their Recruitment and Selection (R&S) processes to maintain high-quality teams, including opening opportunities for beginners, such as trainees and interns. However, given the various judgments and sociotechnical factors involved, this complex process of R&S poses a challenge for recent graduates seeking to enter the market. This paper aims to identify a set of anti-patterns and recommendations for early career SE professionals concerning R&S processes. Under an exploratory and qualitative methodological approach, we conducted six online Focus Groups with 18 recruiters with experience in R&S in the software industry. After completing our qualitative analysis, we identified 12 anti-patterns and 31 actionable recommendations regarding the hiring process focused on entry-level SE professionals. The identified anti-patterns encompass behavioral and technical dimensions innate to R&S processes. These findings provide a rich opportunity for reflection in the SE industry and offer valuable guidance for early-career candidates and organizations. From an academic perspective, this work also raises awareness of the intersection of Human Resources and SE, an area with considerable potential to be expanded in the context of cooperative and human aspects of SE.
不断增长的软件开发市场增加了对软件工程 (SE) 合格专业人员的需求。为此,公司必须加强招聘和选拔(R&S)流程,以保持高素质的团队,包括为培训生和实习生等初学者提供机会。然而,由于涉及到各种判断和社会技术因素,这一复杂的招聘与甄选过程对寻求进入市场的应届毕业生构成了挑战。本文旨在为初入职场的 SE 专业人士找出一套有关研发和生产流程的反模式和建议。在探索性和定性方法论的指导下,我们与 18 位具有软件行业研发经验的招聘人员进行了六次在线焦点小组讨论。在完成定性分析后,我们针对入门级 SE 专业人员的招聘流程,确定了 12 种反模式和 31 项可行建议。所发现的反模式包括 R&S 流程中与生俱来的行为和技术层面。这些发现为 SE 行业提供了丰富的反思机会,并为早期职业候选人和组织提供了宝贵的指导。从学术角度看,这项工作还提高了人们对人力资源与社会企业交叉的认识,在社会企业的合作与人文方面,这一领域具有相当大的扩展潜力。
{"title":"Investigating the online recruitment and selection journey of novice software engineers: Anti-patterns and recommendations","authors":"Miguel Setúbal, Tayana Conte, Marcos Kalinowski, Allysson Allex Araújo","doi":"10.1007/s10664-024-10498-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10498-w","url":null,"abstract":"<p>The growing software development market has increased the demand for qualified professionals in Software Engineering (SE). To this end, companies must enhance their Recruitment and Selection (R&S) processes to maintain high-quality teams, including opening opportunities for beginners, such as trainees and interns. However, given the various judgments and sociotechnical factors involved, this complex process of R&S poses a challenge for recent graduates seeking to enter the market. This paper aims to identify a set of anti-patterns and recommendations for early career SE professionals concerning R&S processes. Under an exploratory and qualitative methodological approach, we conducted six online Focus Groups with 18 recruiters with experience in R&S in the software industry. After completing our qualitative analysis, we identified 12 anti-patterns and 31 actionable recommendations regarding the hiring process focused on entry-level SE professionals. The identified anti-patterns encompass behavioral and technical dimensions innate to R&S processes. These findings provide a rich opportunity for reflection in the SE industry and offer valuable guidance for early-career candidates and organizations. From an academic perspective, this work also raises awareness of the intersection of Human Resources and SE, an area with considerable potential to be expanded in the context of cooperative and human aspects of SE.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"44 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1007/s10664-024-10489-x
Xunzhu Tang, Haoye Tian, Pingfan Kong, Saad Ezzini, Kui Liu, Xin Xia, Jacques Klein, Tegawendé F. Bissyandé
Software development teams generally welcome any effort to expose bugs in their code base. In this work, we build on the hypothesis that mobile apps from the same category (e.g., two web browser apps) may be affected by similar bugs in their evolution process. It is therefore possible to transfer the experience of one historical app to quickly find bugs in its new counterparts. This has been referred to as collaborative bug finding in the literature. Our novelty is that we guide the bug finding process by considering that existing bugs have been hinted within app reviews. Concretely, we design the BugRMSys approach to recommend bug reports for a target app by matching historical bug reports from apps in the same category with user app reviews of the target app. We experimentally show that this approach enables us to quickly expose and report dozens of bugs for targeted apps such as Brave (web browser app). BugRMSys ’s implementation relies on DistilBERT to produce natural language text embeddings. Our pipeline considers similarities between bug reports and app reviews to identify relevant bugs. We then focus on the app review as well as potential reproduction steps in the historical bug report (from a same-category app) to reproduce the bugs. Overall, after applying BugRMSys to six popular apps, we were able to identify, reproduce and report 20 new bugs: among these, 9 reports have been already triaged, 6 were confirmed, and 4 have been fixed by official development teams.
{"title":"App review driven collaborative bug finding","authors":"Xunzhu Tang, Haoye Tian, Pingfan Kong, Saad Ezzini, Kui Liu, Xin Xia, Jacques Klein, Tegawendé F. Bissyandé","doi":"10.1007/s10664-024-10489-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10489-x","url":null,"abstract":"<p>Software development teams generally welcome any effort to expose bugs in their code base. In this work, we build on the hypothesis that mobile apps from the same category (e.g., two web browser apps) may be affected by similar bugs in their evolution process. It is therefore possible to transfer the experience of one historical app to quickly find bugs in its new counterparts. This has been referred to as collaborative bug finding in the literature. Our novelty is that we guide the bug finding process by considering that existing bugs have been hinted within app reviews. Concretely, we design the <span>BugRMSys</span> approach to recommend bug reports for a target app by matching historical bug reports from apps in the same category with user app reviews of the target app. We experimentally show that this approach enables us to quickly expose and report dozens of bugs for targeted apps such as Brave (web browser app). <span>BugRMSys</span> ’s implementation relies on DistilBERT to produce natural language text embeddings. Our pipeline considers similarities between bug reports and app reviews to identify relevant bugs. We then focus on the app review as well as potential reproduction steps in the historical bug report (from a same-category app) to reproduce the bugs. Overall, after applying <span>BugRMSys</span> to six popular apps, we were able to identify, reproduce and report 20 new bugs: among these, 9 reports have been already triaged, 6 were confirmed, and 4 have been fixed by official development teams.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"245 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1007/s10664-024-10524-x
Hongjing Guo, Chuanqi Tao, Zhiqiu Huang
Deep Neural Network (DNN) models are widely used in many cutting-edge domains, such as medical diagnostics and autonomous driving. However, an urgent need to test DNN models thoroughly has increasingly risen. Recent research proposes various structural and non-structural coverage criteria to measure test adequacy. Structural coverage criteria quantify the degree to which the internal elements of DNN models are covered by a test suite. However, they convey little information about individual inputs and exhibit limited correlation with defect detection. Additionally, existing non-structural coverage criteria are unaware of neurons’ importance to decision-making. This paper addresses these limitations by proposing novel non-structural coverage criteria. By tracing neurons’ cumulative contribution to the final decision on the training set, this paper identifies important neurons of DNN models. A novel metric is proposed to quantify the difference in important neuron behavior between a test input and the training set, which provides a measured way at individual test input granularity. Additionally, two non-structural coverage criteria are introduced that allow for the quantification of test adequacy by examining differences in important neuron behavior between the testing and the training set. The empirical evaluation of image datasets demonstrates that the proposed metric outperforms the existing non-structural adequacy metrics by up to 14.7% accuracy improvement in capturing error-revealing test inputs. Compared with state-of-the-art coverage criteria, the proposed coverage criteria are more sensitive to errors, including natural errors and adversarial examples.
{"title":"Neuron importance-aware coverage analysis for deep neural network testing","authors":"Hongjing Guo, Chuanqi Tao, Zhiqiu Huang","doi":"10.1007/s10664-024-10524-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10524-x","url":null,"abstract":"<p>Deep Neural Network (DNN) models are widely used in many cutting-edge domains, such as medical diagnostics and autonomous driving. However, an urgent need to test DNN models thoroughly has increasingly risen. Recent research proposes various structural and non-structural coverage criteria to measure test adequacy. Structural coverage criteria quantify the degree to which the internal elements of DNN models are covered by a test suite. However, they convey little information about individual inputs and exhibit limited correlation with defect detection. Additionally, existing non-structural coverage criteria are unaware of neurons’ importance to decision-making. This paper addresses these limitations by proposing novel non-structural coverage criteria. By tracing neurons’ cumulative contribution to the final decision on the training set, this paper identifies important neurons of DNN models. A novel metric is proposed to quantify the difference in important neuron behavior between a test input and the training set, which provides a measured way at individual test input granularity. Additionally, two non-structural coverage criteria are introduced that allow for the quantification of test adequacy by examining differences in important neuron behavior between the testing and the training set. The empirical evaluation of image datasets demonstrates that the proposed metric outperforms the existing non-structural adequacy metrics by up to 14.7% accuracy improvement in capturing error-revealing test inputs. Compared with state-of-the-art coverage criteria, the proposed coverage criteria are more sensitive to errors, including natural errors and adversarial examples.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"90 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Representing the textual semantics of bug reports is a key component of bug report management (BRM) techniques. Existing studies mainly use classical information retrieval-based (IR-based) approaches, such as the vector space model (VSM) to do semantic extraction. Little attention is paid to exploring whether word embedding (WE) models from the natural language process could help BRM tasks.
Objective
To have a general view of the potential of word embedding models in representing the semantics of bug reports and attempt to provide some actionable guidelines in using semantic retrieval models for BRM tasks.
Method
We studied the efficacy of five widely recognized WE models for six BRM tasks on 20 widely-used products from the Eclipse and Mozilla foundations. Specifically, we first explored the suitable machine learning techniques under the use of WE models and the suitable WE model for BRM tasks. Then we studied whether WE models performed better than classical VSM. Last, we investigated whether WE models fine-tuned with bug reports outperformed general pre-trained WE models.
Key Results
The Random Forest (RF) classifier outperformed other typical classifiers under the use of different WE models in semantic extraction.We rarely observed statistically significant performance differences among five WE models in five BRM classification tasks, but we found that small-dimensional WE models performed better than larger ones in the duplicate bug report detection task. Among three BRM tasks (i.e., bug severity prediction, reopened bug prediction, and duplicate bug report detection) that showed statistically significant performance differences, VSM outperformed the studied WE models. We did not find performance improvement after we fine-tuned general pre-trained BERT with bug report data.
Conclusion
Performance improvements of using pre-trained WE models were not observed in studied BRM tasks. The combination of RF and traditional VSM was found to achieve the best performance in various BRM tasks.
背景呈现错误报告的文本语义是错误报告管理(BRM)技术的关键组成部分。现有研究主要使用经典的基于信息检索(IR)的方法,如向量空间模型(VSM)来进行语义提取。我们在 Eclipse 和 Mozilla 基金会的 20 种广泛使用的产品上研究了五种广受认可的 WE 模型在六种 BRM 任务中的功效。具体来说,我们首先探讨了在使用 WE 模型时适合的机器学习技术,以及适合 BRM 任务的 WE 模型。然后,我们研究了 WE 模型的性能是否优于经典的 VSM。在五项 BRM 分类任务中,我们很少观察到五种 WE 模型之间存在统计学意义上的显著性能差异,但我们发现,在重复错误报告检测任务中,小维度 WE 模型的性能优于大维度 WE 模型。在表现出显著统计学差异的三个 BRM 任务(即错误严重性预测、重新打开的错误预测和重复错误报告检测)中,VSM 的表现优于所研究的 WE 模型。在使用错误报告数据对一般预训练 BERT 进行微调后,我们没有发现性能的提高。我们发现 RF 与传统 VSM 的组合在各种 BRM 任务中取得了最佳性能。
{"title":"An empirical study on the potential of word embedding techniques in bug report management tasks","authors":"Bingting Chen, Weiqin Zou, Biyu Cai, Qianshuang Meng, Wenjie Liu, Piji Li, Lin Chen","doi":"10.1007/s10664-024-10510-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10510-3","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Representing the textual semantics of bug reports is a key component of bug report management (BRM) techniques. Existing studies mainly use classical information retrieval-based (IR-based) approaches, such as the vector space model (VSM) to do semantic extraction. Little attention is paid to exploring whether word embedding (WE) models from the natural language process could help BRM tasks.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>To have a general view of the potential of word embedding models in representing the semantics of bug reports and attempt to provide some actionable guidelines in using semantic retrieval models for BRM tasks.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We studied the efficacy of five widely recognized WE models for six BRM tasks on 20 widely-used products from the Eclipse and Mozilla foundations. Specifically, we first explored the suitable machine learning techniques under the use of WE models and the suitable WE model for BRM tasks. Then we studied whether WE models performed better than classical VSM. Last, we investigated whether WE models fine-tuned with bug reports outperformed general pre-trained WE models.</p><h3 data-test=\"abstract-sub-heading\">Key Results</h3><p>The Random Forest (RF) classifier outperformed other typical classifiers under the use of different WE models in semantic extraction.We rarely observed statistically significant performance differences among five WE models in five BRM classification tasks, but we found that small-dimensional WE models performed better than larger ones in the duplicate bug report detection task. Among three BRM tasks (i.e., bug severity prediction, reopened bug prediction, and duplicate bug report detection) that showed statistically significant performance differences, VSM outperformed the studied WE models. We did not find performance improvement after we fine-tuned general pre-trained BERT with bug report data.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>Performance improvements of using pre-trained WE models were not observed in studied BRM tasks. The combination of RF and traditional VSM was found to achieve the best performance in various BRM tasks.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"55 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}