Empirical Software Engineering最新文献_第3页

IRJIT: A simple, online, information retrieval approach for just-in-time software defect prediction IRJIT：用于及时软件缺陷预测的简单在线信息检索方法

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-08-02 DOI: 10.1007/s10664-024-10514-z

Hareem Sahar, Abdul Ali Bangash, Abram Hindle, Denilson Barbosa

Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time. Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models. These models typically involve extensive training processes that may require significant computational resources and time. These characteristics can pose challenges when attempting to update the models in real-time as new examples become available, potentially impacting their suitability for fast online defect prediction. Furthermore, the reliance on a complex underlying model makes these approaches often less explainable, which means the developers cannot understand the reasons behind models’ predictions. An approach that is not explainable might not be adopted in real-life development environments because of developers’ lack of trust in its results. To address these limitations, we propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits. IRJIT approach is online and explainable as it can learn from new data without expensive retraining, and developers can see the documents that support a prediction, providing additional context. By evaluating 10 open-source datasets in a within project setting, we show that our approach is up to 112 times faster than the state-of-the-art ML and DL approaches, offers explainability at the commit and line level, and has comparable performance to the state-of-the-art.

准时软件缺陷预测（JIT-SDP）通过在提交签入时识别缺陷，防止将缺陷引入软件。目前的软件缺陷预测方法依赖于人工制作的特征，如变更度量，以及昂贵的机器学习或深度学习模型训练。这些模型通常涉及大量的训练过程，可能需要大量的计算资源和时间。当有新实例可用时，试图实时更新模型时，这些特征可能会带来挑战，从而可能影响它们对快速在线缺陷预测的适用性。此外，由于依赖于复杂的底层模型，这些方法的可解释性往往较差，这意味着开发人员无法理解模型预测背后的原因。由于开发人员对其结果缺乏信任，无法解释的方法可能不会在现实开发环境中被采用。为了解决这些局限性，我们提出了一种名为 IRJIT 的方法，它采用源代码信息检索，并根据新提交与过去的错误或干净提交的相似度将其标记为错误或干净。IRJIT 方法是在线的、可解释的，因为它可以从新数据中学习，无需昂贵的再训练，而且开发人员可以看到支持预测的文档，从而提供额外的上下文。通过在项目环境中评估 10 个开源数据集，我们发现我们的方法比最先进的 ML 和 DL 方法快 112 倍，在提交和行级别提供了可解释性，性能与最先进的方法相当。

{"title":"IRJIT: A simple, online, information retrieval approach for just-in-time software defect prediction","authors":"Hareem Sahar, Abdul Ali Bangash, Abram Hindle, Denilson Barbosa","doi":"10.1007/s10664-024-10514-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10514-z","url":null,"abstract":"<p>Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time. Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models. These models typically involve extensive training processes that may require significant computational resources and time. These characteristics can pose challenges when attempting to update the models in real-time as new examples become available, potentially impacting their suitability for fast online defect prediction. Furthermore, the reliance on a complex underlying model makes these approaches often less <i>explainable</i>, which means the developers cannot understand the reasons behind models’ predictions. An approach that is not <i>explainable</i> might not be adopted in real-life development environments because of developers’ lack of trust in its results. To address these limitations, we propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits. IRJIT approach is <i>online</i> and <i>explainable</i> as it can learn from new data without expensive retraining, and developers can see the documents that support a prediction, providing additional context. By evaluating 10 open-source datasets in a within project setting, we show that our approach is up to 112 times faster than the state-of-the-art ML and DL approaches, offers explainability at the commit and line level, and has comparable performance to the state-of-the-art.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"182 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Industrial adoption of machine learning techniques for early identification of invalid bug reports 工业界采用机器学习技术，及早识别无效错误报告

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-07-31 DOI: 10.1007/s10664-024-10502-3

Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, Emelie Engström

Despite the accuracy of machine learning (ML) techniques in predicting invalid bug reports, as shown in earlier research, and the importance of early identification of invalid bug reports in software maintenance, the adoption of ML techniques for this task in industrial practice is yet to be investigated. In this study, we used a technology transfer model to guide the adoption of an ML technique at a company for the early identification of invalid bug reports. In the process, we also identify necessary conditions for adopting such techniques in practice. We followed a case study research approach with various design and analysis iterations for technology transfer activities. We collected data from bug repositories, through focus groups, a questionnaire, and a presentation and feedback session with an expert. As expected, we found that an ML technique can identify invalid bug reports with acceptable accuracy at an early stage. However, the technique’s accuracy drops over time in its operational use due to changes in the product, the used technologies, or the development organization. Such changes may require retraining the ML model. During validation, practitioners highlighted the need to understand the ML technique’s predictions to trust the predictions. We found that a visual (using a state-of-the-art ML interpretation framework) and descriptive explanation of the prediction increases the trustability of the technique compared to just presenting the results of the validity predictions. We conclude that trustability, integration with the existing toolchain, and maintaining the techniques’ accuracy over time are critical for increasing the likelihood of adoption.

尽管早先的研究表明，机器学习（ML）技术在预测无效错误报告方面具有很高的准确性，而且早期识别无效错误报告在软件维护中也非常重要，但在工业实践中采用 ML 技术来完成这项任务还有待研究。在本研究中，我们使用技术转移模型来指导一家公司采用 ML 技术来早期识别无效错误报告。在此过程中，我们还确定了在实践中采用此类技术的必要条件。我们采用案例研究的方法，对技术转让活动进行了各种设计和分析迭代。我们从错误库中收集数据，通过焦点小组、问卷调查以及与专家的演示和反馈会议。不出所料，我们发现人工智能技术可以在早期阶段以可接受的准确度识别出无效的错误报告。然而，随着时间的推移，由于产品、所用技术或开发组织的变化，该技术的准确性会在实际使用中下降。这种变化可能需要重新训练 ML 模型。在验证过程中，实践者强调需要理解 ML 技术的预测，以便信任其预测结果。我们发现，对预测进行可视化（使用最先进的 ML 解释框架）和描述性解释，比仅仅展示有效性预测结果更能提高技术的可信度。我们的结论是，可信任度、与现有工具链的整合以及长期保持技术的准确性对于提高采用的可能性至关重要。

{"title":"Industrial adoption of machine learning techniques for early identification of invalid bug reports","authors":"Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, Emelie Engström","doi":"10.1007/s10664-024-10502-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10502-3","url":null,"abstract":"<p>Despite the accuracy of machine learning (ML) techniques in predicting invalid bug reports, as shown in earlier research, and the importance of early identification of invalid bug reports in software maintenance, the adoption of ML techniques for this task in industrial practice is yet to be investigated. In this study, we used a technology transfer model to guide the adoption of an ML technique at a company for the early identification of invalid bug reports. In the process, we also identify necessary conditions for adopting such techniques in practice. We followed a case study research approach with various design and analysis iterations for technology transfer activities. We collected data from bug repositories, through focus groups, a questionnaire, and a presentation and feedback session with an expert. As expected, we found that an ML technique can identify invalid bug reports with acceptable accuracy at an early stage. However, the technique’s accuracy drops over time in its operational use due to changes in the product, the used technologies, or the development organization. Such changes may require retraining the ML model. During validation, practitioners highlighted the need to understand the ML technique’s predictions to trust the predictions. We found that a visual (using a state-of-the-art ML interpretation framework) and descriptive explanation of the prediction increases the trustability of the technique compared to just presenting the results of the validity predictions. We conclude that trustability, integration with the existing toolchain, and maintaining the techniques’ accuracy over time are critical for increasing the likelihood of adoption.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"42 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dependabot and security pull requests: large empirical study Dependabot 和安全拉动请求：大型实证研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-07-30 DOI: 10.1007/s10664-024-10523-y

Hocine Rebatchi, Tégawendé F. Bissyandé, Naouel Moha

<p>Modern software development is a complex engineering process where developer code cohabits with an increasingly larger number of external open-source components. Even though these components facilitate sharing and reusing code along with other benefits related to maintenance and code quality, they are often the seeds of vulnerabilities in the software supply chain leading to attacks with severe consequences. Indeed, one common strategy used to conduct attacks is to exploit or inject other security flaws in new versions of dependency packages. It is thus important to keep dependencies updated in a software development project. Unfortunately, several prior studies have highlighted that, to a large extent, developers struggle to keep track of the dependency package updates, and do not quickly incorporate security patches. Therefore, automated dependency-update bots have been proposed to mitigate the impact and the emergence of vulnerabilities in open-source projects. In our study, we focus on Dependabot, a dependency management bot that has gained popularity on GitHub recently. It allows developers to keep a lookout on project dependencies and reduce the effort of monitoring the safety of the software supply chain. We performed a large empirical study on dependency updates and security pull requests to understand: (1) the degree and reasons of Dependabot’s popularity; (2) the patterns of developers’ practices and techniques to deal with vulnerabilities in dependencies; (3) the management of security pull requests (PRs), the threat lifetime, and the fix delay; and (4) the factors that significantly correlate with the acceptance of security PRs and fast merges. To that end, we collected a dataset of 9,916,318 pull request-related issues made in 1,743,035 projects on GitHub for more than 10 different programming languages. In addition to the comprehensive quantitative analysis, we performed a manual qualitative analysis on a representative sample of the dataset, and we substantiated our findings by sending a survey to developers that use dependency management tools. Our study shows that Dependabot dominates more than 65% of dependency management activity, mainly due to its efficiency, accessibility, adaptivity, and availability of support. We also found that developers handle dependency vulnerabilities differently, but mainly rely on the automation of PRs generation to upgrade vulnerable dependencies. Interestingly, Dependabot’s and developers’ security PRs are highly accepted, and the automation allows to accelerate their management, so that fixes are applied in less than one day. However, the threat of dependency vulnerabilities remains hidden for 512 days on average, and patches are disclosed after 362 days due to the reliance on the manual effort of security experts. Also, project characteristics, the amount of PR changes, as well as developer and dependency features seem to be highly correlated with the acceptance and fast merges of security PR

现代软件开发是一个复杂的工程过程，开发人员的代码与越来越多的外部开源组件共存。尽管这些组件促进了代码的共享和重用，并带来了与维护和代码质量相关的其他好处，但它们往往是软件供应链中的漏洞种子，导致后果严重的攻击。事实上，一种常用的攻击策略就是在新版本的依赖包中利用或注入其他安全漏洞。因此，在软件开发项目中不断更新依赖包非常重要。遗憾的是，之前的一些研究已经强调，在很大程度上，开发人员很难跟踪依赖包的更新，也不会快速地打上安全补丁。因此，人们提出了自动依赖性更新机器人，以减轻开源项目中漏洞的影响和出现。在我们的研究中，我们关注的是最近在 GitHub 上流行起来的依赖性管理机器人 Dependabot。它可以让开发人员随时关注项目的依赖关系，减少监控软件供应链安全的工作量。我们对依赖关系更新和安全拉取请求进行了大规模的实证研究，以了解：(1) Dependabot 受欢迎的程度和原因；(2) 开发人员处理依赖关系中漏洞的实践和技术模式；(3) 安全拉取请求（PR）的管理、威胁寿命和修复延迟；以及 (4) 与安全 PR 的接受度和快速合并显著相关的因素。为此，我们收集了 GitHub 上 10 多种不同编程语言的 1,743,035 个项目中提出的 9,916,318 个拉请求相关问题的数据集。除了全面的定量分析外，我们还对数据集中的代表性样本进行了人工定性分析，并通过向使用依赖性管理工具的开发人员发送调查问卷来证实我们的研究结果。我们的研究表明，Dependabot 主导了 65% 以上的依赖性管理活动，这主要归功于它的效率、可访问性、适应性和支持可用性。我们还发现，开发人员处理依赖性漏洞的方式各不相同，但主要依靠自动生成 PR 来升级易受攻击的依赖性。有趣的是，Dependabot和开发人员的安全公告被高度接受，自动化可加快其管理速度，从而在不到一天的时间内完成修复。然而，由于依赖于安全专家的人工努力，依赖性漏洞的威胁平均隐藏了 512 天，补丁则在 362 天后才被披露。此外，项目特征、PR 变动量以及开发人员和依赖关系特征似乎与安全 PR 的接受和快速合并高度相关。

{"title":"Dependabot and security pull requests: large empirical study","authors":"Hocine Rebatchi, Tégawendé F. Bissyandé, Naouel Moha","doi":"10.1007/s10664-024-10523-y","DOIUrl":"https://doi.org/10.1007/s10664-024-10523-y","url":null,"abstract":"<p>Modern software development is a complex engineering process where developer code cohabits with an increasingly larger number of external open-source components. Even though these components facilitate sharing and reusing code along with other benefits related to maintenance and code quality, they are often the seeds of vulnerabilities in the software supply chain leading to attacks with severe consequences. Indeed, one common strategy used to conduct attacks is to exploit or inject other security flaws in new versions of dependency packages. It is thus important to keep dependencies updated in a software development project. Unfortunately, several prior studies have highlighted that, to a large extent, developers struggle to keep track of the dependency package updates, and do not quickly incorporate security patches. Therefore, automated dependency-update bots have been proposed to mitigate the impact and the emergence of vulnerabilities in open-source projects. In our study, we focus on Dependabot, a dependency management bot that has gained popularity on GitHub recently. It allows developers to keep a lookout on project dependencies and reduce the effort of monitoring the safety of the software supply chain. We performed a large empirical study on dependency updates and security pull requests to understand: (1) the degree and reasons of Dependabot’s popularity; (2) the patterns of developers’ practices and techniques to deal with vulnerabilities in dependencies; (3) the management of security pull requests (PRs), the threat lifetime, and the fix delay; and (4) the factors that significantly correlate with the acceptance of security PRs and fast merges. To that end, we collected a dataset of 9,916,318 pull request-related issues made in 1,743,035 projects on GitHub for more than 10 different programming languages. In addition to the comprehensive quantitative analysis, we performed a manual qualitative analysis on a representative sample of the dataset, and we substantiated our findings by sending a survey to developers that use dependency management tools. Our study shows that Dependabot dominates more than 65% of dependency management activity, mainly due to its efficiency, accessibility, adaptivity, and availability of support. We also found that developers handle dependency vulnerabilities differently, but mainly rely on the automation of PRs generation to upgrade vulnerable dependencies. Interestingly, Dependabot’s and developers’ security PRs are highly accepted, and the automation allows to accelerate their management, so that fixes are applied in less than one day. However, the threat of dependency vulnerabilities remains hidden for 512 days on average, and patches are disclosed after 362 days due to the reliance on the manual effort of security experts. Also, project characteristics, the amount of PR changes, as well as developer and dependency features seem to be highly correlated with the acceptance and fast merges of security PR","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"296 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DDImage: an image reduction based approach for automatically explaining black-box classifiers DDImage：基于图像还原的黑盒分类器自动解释方法

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-07-30 DOI: 10.1007/s10664-024-10505-0

Mingyue Jiang, Chengjian Tang, Xiao-Yi Zhang, Yangyang Zhao, Zuohua Ding

Due to the prevalent application of machine learning (ML) techniques and the intrinsic black-box nature of ML models, the need for good explanations that are sufficient and necessary towards locally interpreting a model’s prediction has been well recognized and emphasized. Existing explanation approaches, however, favor either the sufficiency or necessity. To fill this gap, in this paper, we propose an approach for generating local explanations that are both sufficient and necessary. Our approach, DDImage, automatically produces local explanations for ML-based image classifiers in a post-hoc way. The core idea behind DDImage is to discover an appropriate explanation by debugging the given input image via a series of image reductions, with respect to the sufficiency and necessity properties. Evaluation of DDImage using publicly available datasets and popular classification models reveals its effectiveness and efficiency. Compared with three state-of-the-art approaches, DDImage demonstrates a superior performance in producing small-sized explanations preserving both sufficiency and necessity, and it also shows promising stability and efficiency. We also identify the impact of segmentation granularity, reveal the performance variance for different target models, and further show that our approach is applicable across different problem domains.

由于机器学习（ML）技术的普遍应用和 ML 模型固有的黑箱性质，人们充分认识到并强调了对模型预测进行充分和必要的良好解释的必要性。然而，现有的解释方法倾向于充分性或必要性。为了填补这一空白，我们在本文中提出了一种生成既充分又必要的本地解释的方法。我们的方法，即 DDImage，能以事后的方式为基于 ML 的图像分类器自动生成局部解释。DDImage 背后的核心理念是通过一系列图像还原来调试给定的输入图像，从而根据充分性和必要性属性发现适当的解释。使用公开可用的数据集和流行的分类模型对 DDImage 进行的评估显示了其有效性和效率。与三种最先进的方法相比，DDImage 在生成同时保留充分性和必要性的小尺寸解释方面表现出色，而且还显示出良好的稳定性和效率。我们还确定了细分粒度的影响，揭示了不同目标模型的性能差异，并进一步表明我们的方法适用于不同的问题领域。

{"title":"DDImage: an image reduction based approach for automatically explaining black-box classifiers","authors":"Mingyue Jiang, Chengjian Tang, Xiao-Yi Zhang, Yangyang Zhao, Zuohua Ding","doi":"10.1007/s10664-024-10505-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10505-0","url":null,"abstract":"<p>Due to the prevalent application of machine learning (ML) techniques and the intrinsic black-box nature of ML models, the need for good explanations that are sufficient and necessary towards locally interpreting a model’s prediction has been well recognized and emphasized. Existing explanation approaches, however, favor either the sufficiency or necessity. To fill this gap, in this paper, we propose an approach for generating local explanations that are both sufficient and necessary. Our approach, DDImage, automatically produces local explanations for ML-based image classifiers in a post-hoc way. The core idea behind DDImage is to discover an appropriate explanation by debugging the given input image via a series of image reductions, with respect to the sufficiency and necessity properties. Evaluation of DDImage using publicly available datasets and popular classification models reveals its effectiveness and efficiency. Compared with three state-of-the-art approaches, DDImage demonstrates a superior performance in producing small-sized explanations preserving both sufficiency and necessity, and it also shows promising stability and efficiency. We also identify the impact of segmentation granularity, reveal the performance variance for different target models, and further show that our approach is applicable across different problem domains.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"214 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adopting automated bug assignment in practice — a longitudinal case study at Ericsson 在实践中采用自动错误分派--爱立信纵向案例研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-07-30 DOI: 10.1007/s10664-024-10507-y

Markus Borg, Leif Jonsson, Emelie Engström, Béla Bartalos, Attila Szabó

[Context] The continuous inflow of bug reports is a considerable challenge in large development projects. Inspired by contemporary work on mining software repositories, we designed a prototype bug assignment solution based on machine learning in 2011-2016. The prototype evolved into an internal Ericsson product, TRR, in 2017-2018. TRR’s first bug assignment without human intervention happened in April 2019. [Objective] Our study evaluates the adoption of TRR within its industrial context at Ericsson, i.e., we provide lessons learned related to the productization of a research prototype within a company. Moreover, we investigate 1) how TRR performs in the field, 2) what value TRR provides to Ericsson, and 3) how TRR has influenced the ways of working. [Method] We conduct a preregistered industrial case study combining interviews with TRR stakeholders, minutes from sprint planning meetings, and bug-tracking data. The data analysis includes thematic analysis, descriptive statistics, and Bayesian causal analysis. [Results] TRR is now an incorporated part of the bug assignment process. Considering the abstraction levels of the telecommunications stack, high-level modules are more positive while low-level modules experienced some drawbacks. Most importantly, some bug reports directly reach low-level modules without first having passed through fundamental root-cause analysis steps at higher levels. On average, TRR automatically assigns 30% of the incoming bug reports with an accuracy of 75%. Auto-routed TRs are resolved around 21% faster within Ericsson, and TRR has saved highly seasoned engineers many hours of work. Indirect effects of adopting TRR include process improvements, process awareness, increased communication, and higher job satisfaction. [Conclusions] TRR has saved time at Ericsson, but the adoption of automated bug assignment was more intricate compared to similar endeavors reported from other companies. We primarily attribute the difference to the very large size of the organization and the complex products. Key facilitators in the successful adoption include a gradual introduction, product champions, and careful stakeholder analysis.

[背景] 在大型开发项目中，不断涌入的错误报告是一个相当大的挑战。受当代软件库挖掘工作的启发，我们在 2011-2016 年设计了一个基于机器学习的错误分配解决方案原型。该原型于 2017-2018 年发展成为爱立信内部产品 TRR。2019 年 4 月，TRR 在没有人工干预的情况下完成了首次错误分派。[目标]我们的研究评估了爱立信在工业背景下采用 TRR 的情况，即提供了与公司内部研究原型产品化相关的经验教训。此外，我们还调查了 1) TRR 在现场的表现，2) TRR 为爱立信带来的价值，以及 3) TRR 如何影响了工作方式。[方法] 我们结合对 TRR 利益相关者的访谈、冲刺计划会议记录和错误跟踪数据，开展了一项预先注册的工业案例研究。数据分析包括主题分析、描述性统计和贝叶斯因果分析。[结果] TRR 现已成为错误分派流程的一部分。考虑到电信栈的抽象层次，高层模块更积极，而低层模块则存在一些缺陷。最重要的是，一些错误报告直接到达低层模块，而没有首先通过高层的基本根源分析步骤。平均而言，TRR 自动分配了 30% 的错误报告，准确率为 75%。在爱立信内部，自动路由 TR 的解决速度提高了约 21%，TRR 为经验丰富的工程师节省了许多工作时间。采用 TRR 的间接效果包括流程改进、流程意识、沟通增加和工作满意度提高。[结论] TRR 为爱立信节省了时间，但与其他公司报告的类似工作相比，自动错误分配的采用更为复杂。我们将这种差异主要归因于该公司的庞大规模和复杂的产品。成功采用的关键因素包括逐步引入、产品拥护者和对利益相关者的仔细分析。

{"title":"Adopting automated bug assignment in practice — a longitudinal case study at Ericsson","authors":"Markus Borg, Leif Jonsson, Emelie Engström, Béla Bartalos, Attila Szabó","doi":"10.1007/s10664-024-10507-y","DOIUrl":"https://doi.org/10.1007/s10664-024-10507-y","url":null,"abstract":"<p>[Context] The continuous inflow of bug reports is a considerable challenge in large development projects. Inspired by contemporary work on mining software repositories, we designed a prototype bug assignment solution based on machine learning in 2011-2016. The prototype evolved into an internal Ericsson product, TRR, in 2017-2018. TRR’s first bug assignment without human intervention happened in April 2019. [Objective] Our study evaluates the adoption of TRR within its industrial context at Ericsson, i.e., we provide lessons learned related to the productization of a research prototype within a company. Moreover, we investigate 1) how TRR performs in the field, 2) what value TRR provides to Ericsson, and 3) how TRR has influenced the ways of working. [Method] We conduct a preregistered industrial case study combining interviews with TRR stakeholders, minutes from sprint planning meetings, and bug-tracking data. The data analysis includes thematic analysis, descriptive statistics, and Bayesian causal analysis. [Results] TRR is now an incorporated part of the bug assignment process. Considering the abstraction levels of the telecommunications stack, high-level modules are more positive while low-level modules experienced some drawbacks. Most importantly, some bug reports directly reach low-level modules without first having passed through fundamental root-cause analysis steps at higher levels. On average, TRR automatically assigns 30% of the incoming bug reports with an accuracy of 75%. Auto-routed TRs are resolved around 21% faster within Ericsson, and TRR has saved highly seasoned engineers many hours of work. Indirect effects of adopting TRR include process improvements, process awareness, increased communication, and higher job satisfaction. [Conclusions] TRR has saved time at Ericsson, but the adoption of automated bug assignment was more intricate compared to similar endeavors reported from other companies. We primarily attribute the difference to the very large size of the organization and the complex products. Key facilitators in the successful adoption include a gradual introduction, product champions, and careful stakeholder analysis.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"74 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Testing the past: can we still run tests in past snapshots for Java projects? 测试过去：我们还能在 Java 项目的过去快照中运行测试吗？

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-07-30 DOI: 10.1007/s10664-024-10530-z

Michel Maes-Bermejo, Micael Gallego, Francisco Gortázar, Gregorio Robles, Jesus M. Gonzalez-Barahona

Building past snapshots of a software project has shown to be of interest both for researchers and practitioners. However, little attention has been devoted specifically to tests available in those past snapshots, which are fundamental for the maintenance of old versions still in production. The aim of this study is to determine to which extent tests of past snapshots can be executed successfully, which would mean these past snapshots are still testable. Given a software project, we build all its past snapshots from source code, including tests, and then run the tests. When tests do not result in success, we also record the reasons, allowing us to determine factors that make tests fail. We run this method on a total of 86 Java projects. On average, for 52.53% of the project snapshots on which tests can be built, all tests pass. However, on average, 94.14% of tests pass in previous snapshots when we take into account the percentage of tests passing in the snapshots used for building those tests. In real software projects, successfully running tests in past snapshots is not something that we can take for granted: we have found that in a large proportion of the projects we studied this does not happen frequently. We have found that the building from source code is the main limitation when running tests on past snapshots. However, we have found some projects for which tests run successfully in a very large fraction of past snapshots, which allows us to identify good practices. We also provide a framework and metrics to quantify testability (the extent to which we are able to run tests of a snapshot with a success result) of past snapshots from several points of view, which simplifies new analyses on this matter, and could help to measure how any project performs in this respect.

建立软件项目的过往快照已显示出研究人员和从业人员的兴趣所在。然而，很少有人专门关注这些过去快照中的测试，而这些测试对于维护仍在生产中的旧版本至关重要。本研究的目的是确定过去快照的测试在多大程度上可以成功执行，这意味着这些过去的快照仍然是可测试的。给定一个软件项目，我们从源代码中构建其过去的所有快照，包括测试，然后运行测试。当测试没有成功时，我们也会记录原因，从而确定导致测试失败的因素。我们在总共 86 个 Java 项目上运行了这种方法。平均而言，在 52.53% 的可构建测试的项目快照中，所有测试都通过了。但是，如果考虑到用于构建测试的快照中测试通过的百分比，则平均有 94.14% 的测试在之前的快照中通过。在实际软件项目中，成功运行过去快照中的测试并非理所当然：我们发现，在我们研究的大部分项目中，这种情况并不常见。我们发现，在过去快照中运行测试时，从源代码构建是主要的限制因素。不过，我们也发现有些项目的测试在很大一部分过去快照中都能成功运行，这使我们能够识别良好的实践。我们还提供了一个框架和衡量标准，从多个角度量化过去快照的可测试性（我们能在多大程度上对快照运行测试并取得成功结果），从而简化了对这一问题的新分析，并有助于衡量任何项目在这方面的表现。

{"title":"Testing the past: can we still run tests in past snapshots for Java projects?","authors":"Michel Maes-Bermejo, Micael Gallego, Francisco Gortázar, Gregorio Robles, Jesus M. Gonzalez-Barahona","doi":"10.1007/s10664-024-10530-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10530-z","url":null,"abstract":"<p>Building past snapshots of a software project has shown to be of interest both for researchers and practitioners. However, little attention has been devoted specifically to tests available in those past snapshots, which are fundamental for the maintenance of old versions still in production. The aim of this study is to determine to which extent tests of past snapshots can be executed successfully, which would mean these past snapshots are still testable. Given a software project, we build all its past snapshots from source code, including tests, and then run the tests. When tests do not result in success, we also record the reasons, allowing us to determine factors that make tests fail. We run this method on a total of 86 Java projects. On average, for 52.53% of the project snapshots on which tests can be built, all tests pass. However, on average, 94.14% of tests pass in previous snapshots when we take into account the percentage of tests passing in the snapshots used for building those tests. In real software projects, successfully running tests in past snapshots is not something that we can take for granted: we have found that in a large proportion of the projects we studied this does not happen frequently. We have found that the building from source code is the main limitation when running tests on past snapshots. However, we have found some projects for which tests run successfully in a very large fraction of past snapshots, which allows us to identify good practices. We also provide a framework and metrics to quantify testability (the extent to which we are able to run tests of a snapshot with a success result) of past snapshots from several points of view, which simplifies new analyses on this matter, and could help to measure how any project performs in this respect.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"78 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating the online recruitment and selection journey of novice software engineers: Anti-patterns and recommendations 调查软件工程师新手的在线招聘和选拔历程：反模式与建议

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-07-30 DOI: 10.1007/s10664-024-10498-w

Miguel Setúbal, Tayana Conte, Marcos Kalinowski, Allysson Allex Araújo

The growing software development market has increased the demand for qualified professionals in Software Engineering (SE). To this end, companies must enhance their Recruitment and Selection (R&S) processes to maintain high-quality teams, including opening opportunities for beginners, such as trainees and interns. However, given the various judgments and sociotechnical factors involved, this complex process of R&S poses a challenge for recent graduates seeking to enter the market. This paper aims to identify a set of anti-patterns and recommendations for early career SE professionals concerning R&S processes. Under an exploratory and qualitative methodological approach, we conducted six online Focus Groups with 18 recruiters with experience in R&S in the software industry. After completing our qualitative analysis, we identified 12 anti-patterns and 31 actionable recommendations regarding the hiring process focused on entry-level SE professionals. The identified anti-patterns encompass behavioral and technical dimensions innate to R&S processes. These findings provide a rich opportunity for reflection in the SE industry and offer valuable guidance for early-career candidates and organizations. From an academic perspective, this work also raises awareness of the intersection of Human Resources and SE, an area with considerable potential to be expanded in the context of cooperative and human aspects of SE.

不断增长的软件开发市场增加了对软件工程 (SE) 合格专业人员的需求。为此，公司必须加强招聘和选拔（R&S）流程，以保持高素质的团队，包括为培训生和实习生等初学者提供机会。然而，由于涉及到各种判断和社会技术因素，这一复杂的招聘与甄选过程对寻求进入市场的应届毕业生构成了挑战。本文旨在为初入职场的 SE 专业人士找出一套有关研发和生产流程的反模式和建议。在探索性和定性方法论的指导下，我们与 18 位具有软件行业研发经验的招聘人员进行了六次在线焦点小组讨论。在完成定性分析后，我们针对入门级 SE 专业人员的招聘流程，确定了 12 种反模式和 31 项可行建议。所发现的反模式包括 R&S 流程中与生俱来的行为和技术层面。这些发现为 SE 行业提供了丰富的反思机会，并为早期职业候选人和组织提供了宝贵的指导。从学术角度看，这项工作还提高了人们对人力资源与社会企业交叉的认识，在社会企业的合作与人文方面，这一领域具有相当大的扩展潜力。

{"title":"Investigating the online recruitment and selection journey of novice software engineers: Anti-patterns and recommendations","authors":"Miguel Setúbal, Tayana Conte, Marcos Kalinowski, Allysson Allex Araújo","doi":"10.1007/s10664-024-10498-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10498-w","url":null,"abstract":"<p>The growing software development market has increased the demand for qualified professionals in Software Engineering (SE). To this end, companies must enhance their Recruitment and Selection (R&S) processes to maintain high-quality teams, including opening opportunities for beginners, such as trainees and interns. However, given the various judgments and sociotechnical factors involved, this complex process of R&S poses a challenge for recent graduates seeking to enter the market. This paper aims to identify a set of anti-patterns and recommendations for early career SE professionals concerning R&S processes. Under an exploratory and qualitative methodological approach, we conducted six online Focus Groups with 18 recruiters with experience in R&S in the software industry. After completing our qualitative analysis, we identified 12 anti-patterns and 31 actionable recommendations regarding the hiring process focused on entry-level SE professionals. The identified anti-patterns encompass behavioral and technical dimensions innate to R&S processes. These findings provide a rich opportunity for reflection in the SE industry and offer valuable guidance for early-career candidates and organizations. From an academic perspective, this work also raises awareness of the intersection of Human Resources and SE, an area with considerable potential to be expanded in the context of cooperative and human aspects of SE.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"44 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

App review driven collaborative bug finding 应用程序审查驱动的协作式错误查找

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-07-26 DOI: 10.1007/s10664-024-10489-x

Xunzhu Tang, Haoye Tian, Pingfan Kong, Saad Ezzini, Kui Liu, Xin Xia, Jacques Klein, Tegawendé F. Bissyandé

Software development teams generally welcome any effort to expose bugs in their code base. In this work, we build on the hypothesis that mobile apps from the same category (e.g., two web browser apps) may be affected by similar bugs in their evolution process. It is therefore possible to transfer the experience of one historical app to quickly find bugs in its new counterparts. This has been referred to as collaborative bug finding in the literature. Our novelty is that we guide the bug finding process by considering that existing bugs have been hinted within app reviews. Concretely, we design the BugRMSys approach to recommend bug reports for a target app by matching historical bug reports from apps in the same category with user app reviews of the target app. We experimentally show that this approach enables us to quickly expose and report dozens of bugs for targeted apps such as Brave (web browser app). BugRMSys ’s implementation relies on DistilBERT to produce natural language text embeddings. Our pipeline considers similarities between bug reports and app reviews to identify relevant bugs. We then focus on the app review as well as potential reproduction steps in the historical bug report (from a same-category app) to reproduce the bugs. Overall, after applying BugRMSys to six popular apps, we were able to identify, reproduce and report 20 new bugs: among these, 9 reports have been already triaged, 6 were confirmed, and 4 have been fixed by official development teams.

软件开发团队通常欢迎任何暴露代码库中错误的努力。在这项工作中，我们提出了一个假设，即同一类别的移动应用程序（如两个网络浏览器应用程序）在其演化过程中可能会受到类似错误的影响。因此，有可能将一个历史应用程序的经验用于快速查找其新的对应程序中的错误。这在文献中被称为协作式错误查找。我们的新颖之处在于，我们通过考虑应用程序评论中提示的现有错误来指导错误查找过程。具体来说，我们设计了 BugRMSys 方法，通过匹配同类应用程序的历史错误报告和用户对目标应用程序的评论，为目标应用程序推荐错误报告。我们通过实验证明，这种方法能让我们快速揭露和报告目标应用程序（如 Brave（网页浏览器应用程序））的数十个错误。BugRMSys 的实现依赖于 DistilBERT 生成自然语言文本嵌入。我们的管道考虑了错误报告和应用程序评论之间的相似性，以识别相关的错误。然后，我们关注应用程序评论以及历史错误报告（来自同类应用程序）中的潜在重现步骤，以重现错误。总体而言，在将 BugRMSys 应用于六款流行应用程序后，我们能够识别、重现并报告 20 个新错误：其中，9 个报告已被分流，6 个被确认，4 个已由官方开发团队修复。

{"title":"App review driven collaborative bug finding","authors":"Xunzhu Tang, Haoye Tian, Pingfan Kong, Saad Ezzini, Kui Liu, Xin Xia, Jacques Klein, Tegawendé F. Bissyandé","doi":"10.1007/s10664-024-10489-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10489-x","url":null,"abstract":"<p>Software development teams generally welcome any effort to expose bugs in their code base. In this work, we build on the hypothesis that mobile apps from the same category (e.g., two web browser apps) may be affected by similar bugs in their evolution process. It is therefore possible to transfer the experience of one historical app to quickly find bugs in its new counterparts. This has been referred to as collaborative bug finding in the literature. Our novelty is that we guide the bug finding process by considering that existing bugs have been hinted within app reviews. Concretely, we design the <span>BugRMSys</span> approach to recommend bug reports for a target app by matching historical bug reports from apps in the same category with user app reviews of the target app. We experimentally show that this approach enables us to quickly expose and report dozens of bugs for targeted apps such as Brave (web browser app). <span>BugRMSys</span> ’s implementation relies on DistilBERT to produce natural language text embeddings. Our pipeline considers similarities between bug reports and app reviews to identify relevant bugs. We then focus on the app review as well as potential reproduction steps in the historical bug report (from a same-category app) to reproduce the bugs. Overall, after applying <span>BugRMSys</span> to six popular apps, we were able to identify, reproduce and report 20 new bugs: among these, 9 reports have been already triaged, 6 were confirmed, and 4 have been fixed by official development teams.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"245 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neuron importance-aware coverage analysis for deep neural network testing 用于深度神经网络测试的神经元重要性感知覆盖率分析

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-07-25 DOI: 10.1007/s10664-024-10524-x

Hongjing Guo, Chuanqi Tao, Zhiqiu Huang

Deep Neural Network (DNN) models are widely used in many cutting-edge domains, such as medical diagnostics and autonomous driving. However, an urgent need to test DNN models thoroughly has increasingly risen. Recent research proposes various structural and non-structural coverage criteria to measure test adequacy. Structural coverage criteria quantify the degree to which the internal elements of DNN models are covered by a test suite. However, they convey little information about individual inputs and exhibit limited correlation with defect detection. Additionally, existing non-structural coverage criteria are unaware of neurons’ importance to decision-making. This paper addresses these limitations by proposing novel non-structural coverage criteria. By tracing neurons’ cumulative contribution to the final decision on the training set, this paper identifies important neurons of DNN models. A novel metric is proposed to quantify the difference in important neuron behavior between a test input and the training set, which provides a measured way at individual test input granularity. Additionally, two non-structural coverage criteria are introduced that allow for the quantification of test adequacy by examining differences in important neuron behavior between the testing and the training set. The empirical evaluation of image datasets demonstrates that the proposed metric outperforms the existing non-structural adequacy metrics by up to 14.7% accuracy improvement in capturing error-revealing test inputs. Compared with state-of-the-art coverage criteria, the proposed coverage criteria are more sensitive to errors, including natural errors and adversarial examples.

深度神经网络（DNN）模型被广泛应用于许多前沿领域，如医疗诊断和自动驾驶。然而，对 DNN 模型进行全面测试的迫切需求日益高涨。最近的研究提出了各种结构性和非结构性覆盖标准来衡量测试的充分性。结构覆盖标准量化 DNN 模型内部元素被测试套件覆盖的程度。然而，这些标准传达的单个输入信息很少，与缺陷检测的相关性有限。此外，现有的非结构覆盖标准没有意识到神经元对决策的重要性。本文通过提出新型非结构覆盖标准来解决这些局限性。通过追踪神经元对训练集最终决策的累积贡献，本文确定了 DNN 模型的重要神经元。本文提出了一种新的度量标准，用于量化测试输入与训练集之间重要神经元行为的差异，从而提供了单个测试输入粒度的测量方法。此外，还引入了两个非结构覆盖标准，通过检查测试集和训练集之间重要神经元行为的差异，量化测试的充分性。对图像数据集的实证评估表明，在捕捉揭示错误的测试输入方面，所提出的指标优于现有的非结构充分性指标，准确率提高了 14.7%。与最先进的覆盖率标准相比，所提出的覆盖率标准对错误（包括自然错误和对抗性示例）更加敏感。

{"title":"Neuron importance-aware coverage analysis for deep neural network testing","authors":"Hongjing Guo, Chuanqi Tao, Zhiqiu Huang","doi":"10.1007/s10664-024-10524-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10524-x","url":null,"abstract":"<p>Deep Neural Network (DNN) models are widely used in many cutting-edge domains, such as medical diagnostics and autonomous driving. However, an urgent need to test DNN models thoroughly has increasingly risen. Recent research proposes various structural and non-structural coverage criteria to measure test adequacy. Structural coverage criteria quantify the degree to which the internal elements of DNN models are covered by a test suite. However, they convey little information about individual inputs and exhibit limited correlation with defect detection. Additionally, existing non-structural coverage criteria are unaware of neurons’ importance to decision-making. This paper addresses these limitations by proposing novel non-structural coverage criteria. By tracing neurons’ cumulative contribution to the final decision on the training set, this paper identifies important neurons of DNN models. A novel metric is proposed to quantify the difference in important neuron behavior between a test input and the training set, which provides a measured way at individual test input granularity. Additionally, two non-structural coverage criteria are introduced that allow for the quantification of test adequacy by examining differences in important neuron behavior between the testing and the training set. The empirical evaluation of image datasets demonstrates that the proposed metric outperforms the existing non-structural adequacy metrics by up to 14.7% accuracy improvement in capturing error-revealing test inputs. Compared with state-of-the-art coverage criteria, the proposed coverage criteria are more sensitive to errors, including natural errors and adversarial examples.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"90 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An empirical study on the potential of word embedding techniques in bug report management tasks 单词嵌入技术在错误报告管理任务中的潜力实证研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-07-25 DOI: 10.1007/s10664-024-10510-3

Bingting Chen, Weiqin Zou, Biyu Cai, Qianshuang Meng, Wenjie Liu, Piji Li, Lin Chen

Context

Representing the textual semantics of bug reports is a key component of bug report management (BRM) techniques. Existing studies mainly use classical information retrieval-based (IR-based) approaches, such as the vector space model (VSM) to do semantic extraction. Little attention is paid to exploring whether word embedding (WE) models from the natural language process could help BRM tasks.

Objective

To have a general view of the potential of word embedding models in representing the semantics of bug reports and attempt to provide some actionable guidelines in using semantic retrieval models for BRM tasks.

Method

We studied the efficacy of five widely recognized WE models for six BRM tasks on 20 widely-used products from the Eclipse and Mozilla foundations. Specifically, we first explored the suitable machine learning techniques under the use of WE models and the suitable WE model for BRM tasks. Then we studied whether WE models performed better than classical VSM. Last, we investigated whether WE models fine-tuned with bug reports outperformed general pre-trained WE models.

Key Results

The Random Forest (RF) classifier outperformed other typical classifiers under the use of different WE models in semantic extraction.We rarely observed statistically significant performance differences among five WE models in five BRM classification tasks, but we found that small-dimensional WE models performed better than larger ones in the duplicate bug report detection task. Among three BRM tasks (i.e., bug severity prediction, reopened bug prediction, and duplicate bug report detection) that showed statistically significant performance differences, VSM outperformed the studied WE models. We did not find performance improvement after we fine-tuned general pre-trained BERT with bug report data.

Conclusion

Performance improvements of using pre-trained WE models were not observed in studied BRM tasks. The combination of RF and traditional VSM was found to achieve the best performance in various BRM tasks.

背景呈现错误报告的文本语义是错误报告管理（BRM）技术的关键组成部分。现有研究主要使用经典的基于信息检索（IR）的方法，如向量空间模型（VSM）来进行语义提取。我们在 Eclipse 和 Mozilla 基金会的 20 种广泛使用的产品上研究了五种广受认可的 WE 模型在六种 BRM 任务中的功效。具体来说，我们首先探讨了在使用 WE 模型时适合的机器学习技术，以及适合 BRM 任务的 WE 模型。然后，我们研究了 WE 模型的性能是否优于经典的 VSM。在五项 BRM 分类任务中，我们很少观察到五种 WE 模型之间存在统计学意义上的显著性能差异，但我们发现，在重复错误报告检测任务中，小维度 WE 模型的性能优于大维度 WE 模型。在表现出显著统计学差异的三个 BRM 任务（即错误严重性预测、重新打开的错误预测和重复错误报告检测）中，VSM 的表现优于所研究的 WE 模型。在使用错误报告数据对一般预训练 BERT 进行微调后，我们没有发现性能的提高。我们发现 RF 与传统 VSM 的组合在各种 BRM 任务中取得了最佳性能。

{"title":"An empirical study on the potential of word embedding techniques in bug report management tasks","authors":"Bingting Chen, Weiqin Zou, Biyu Cai, Qianshuang Meng, Wenjie Liu, Piji Li, Lin Chen","doi":"10.1007/s10664-024-10510-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10510-3","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Representing the textual semantics of bug reports is a key component of bug report management (BRM) techniques. Existing studies mainly use classical information retrieval-based (IR-based) approaches, such as the vector space model (VSM) to do semantic extraction. Little attention is paid to exploring whether word embedding (WE) models from the natural language process could help BRM tasks.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>To have a general view of the potential of word embedding models in representing the semantics of bug reports and attempt to provide some actionable guidelines in using semantic retrieval models for BRM tasks.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We studied the efficacy of five widely recognized WE models for six BRM tasks on 20 widely-used products from the Eclipse and Mozilla foundations. Specifically, we first explored the suitable machine learning techniques under the use of WE models and the suitable WE model for BRM tasks. Then we studied whether WE models performed better than classical VSM. Last, we investigated whether WE models fine-tuned with bug reports outperformed general pre-trained WE models.</p><h3 data-test=\"abstract-sub-heading\">Key Results</h3><p>The Random Forest (RF) classifier outperformed other typical classifiers under the use of different WE models in semantic extraction.We rarely observed statistically significant performance differences among five WE models in five BRM classification tasks, but we found that small-dimensional WE models performed better than larger ones in the duplicate bug report detection task. Among three BRM tasks (i.e., bug severity prediction, reopened bug prediction, and duplicate bug report detection) that showed statistically significant performance differences, VSM outperformed the studied WE models. We did not find performance improvement after we fine-tuned general pre-trained BERT with bug report data.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>Performance improvements of using pre-trained WE models were not observed in studied BRM tasks. The combination of RF and traditional VSM was found to achieve the best performance in various BRM tasks.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"55 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0