首页 > 最新文献

Empirical Software Engineering最新文献

英文 中文
Dependabot and security pull requests: large empirical study Dependabot 和安全拉动请求:大型实证研究
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-30 DOI: 10.1007/s10664-024-10523-y
Hocine Rebatchi, Tégawendé F. Bissyandé, Naouel Moha
<p>Modern software development is a complex engineering process where developer code cohabits with an increasingly larger number of external open-source components. Even though these components facilitate sharing and reusing code along with other benefits related to maintenance and code quality, they are often the seeds of vulnerabilities in the software supply chain leading to attacks with severe consequences. Indeed, one common strategy used to conduct attacks is to exploit or inject other security flaws in new versions of dependency packages. It is thus important to keep dependencies updated in a software development project. Unfortunately, several prior studies have highlighted that, to a large extent, developers struggle to keep track of the dependency package updates, and do not quickly incorporate security patches. Therefore, automated dependency-update bots have been proposed to mitigate the impact and the emergence of vulnerabilities in open-source projects. In our study, we focus on Dependabot, a dependency management bot that has gained popularity on GitHub recently. It allows developers to keep a lookout on project dependencies and reduce the effort of monitoring the safety of the software supply chain. We performed a large empirical study on dependency updates and security pull requests to understand: (1) the degree and reasons of Dependabot’s popularity; (2) the patterns of developers’ practices and techniques to deal with vulnerabilities in dependencies; (3) the management of security pull requests (PRs), the threat lifetime, and the fix delay; and (4) the factors that significantly correlate with the acceptance of security PRs and fast merges. To that end, we collected a dataset of 9,916,318 pull request-related issues made in 1,743,035 projects on GitHub for more than 10 different programming languages. In addition to the comprehensive quantitative analysis, we performed a manual qualitative analysis on a representative sample of the dataset, and we substantiated our findings by sending a survey to developers that use dependency management tools. Our study shows that Dependabot dominates more than 65% of dependency management activity, mainly due to its efficiency, accessibility, adaptivity, and availability of support. We also found that developers handle dependency vulnerabilities differently, but mainly rely on the automation of PRs generation to upgrade vulnerable dependencies. Interestingly, Dependabot’s and developers’ security PRs are highly accepted, and the automation allows to accelerate their management, so that fixes are applied in less than one day. However, the threat of dependency vulnerabilities remains hidden for 512 days on average, and patches are disclosed after 362 days due to the reliance on the manual effort of security experts. Also, project characteristics, the amount of PR changes, as well as developer and dependency features seem to be highly correlated with the acceptance and fast merges of security PR
现代软件开发是一个复杂的工程过程,开发人员的代码与越来越多的外部开源组件共存。尽管这些组件促进了代码的共享和重用,并带来了与维护和代码质量相关的其他好处,但它们往往是软件供应链中的漏洞种子,导致后果严重的攻击。事实上,一种常用的攻击策略就是在新版本的依赖包中利用或注入其他安全漏洞。因此,在软件开发项目中不断更新依赖包非常重要。遗憾的是,之前的一些研究已经强调,在很大程度上,开发人员很难跟踪依赖包的更新,也不会快速地打上安全补丁。因此,人们提出了自动依赖性更新机器人,以减轻开源项目中漏洞的影响和出现。在我们的研究中,我们关注的是最近在 GitHub 上流行起来的依赖性管理机器人 Dependabot。它可以让开发人员随时关注项目的依赖关系,减少监控软件供应链安全的工作量。我们对依赖关系更新和安全拉取请求进行了大规模的实证研究,以了解:(1) Dependabot 受欢迎的程度和原因;(2) 开发人员处理依赖关系中漏洞的实践和技术模式;(3) 安全拉取请求(PR)的管理、威胁寿命和修复延迟;以及 (4) 与安全 PR 的接受度和快速合并显著相关的因素。为此,我们收集了 GitHub 上 10 多种不同编程语言的 1,743,035 个项目中提出的 9,916,318 个拉请求相关问题的数据集。除了全面的定量分析外,我们还对数据集中的代表性样本进行了人工定性分析,并通过向使用依赖性管理工具的开发人员发送调查问卷来证实我们的研究结果。我们的研究表明,Dependabot 主导了 65% 以上的依赖性管理活动,这主要归功于它的效率、可访问性、适应性和支持可用性。我们还发现,开发人员处理依赖性漏洞的方式各不相同,但主要依靠自动生成 PR 来升级易受攻击的依赖性。有趣的是,Dependabot和开发人员的安全公告被高度接受,自动化可加快其管理速度,从而在不到一天的时间内完成修复。然而,由于依赖于安全专家的人工努力,依赖性漏洞的威胁平均隐藏了 512 天,补丁则在 362 天后才被披露。此外,项目特征、PR 变动量以及开发人员和依赖关系特征似乎与安全 PR 的接受和快速合并高度相关。
{"title":"Dependabot and security pull requests: large empirical study","authors":"Hocine Rebatchi, Tégawendé F. Bissyandé, Naouel Moha","doi":"10.1007/s10664-024-10523-y","DOIUrl":"https://doi.org/10.1007/s10664-024-10523-y","url":null,"abstract":"&lt;p&gt;Modern software development is a complex engineering process where developer code cohabits with an increasingly larger number of external open-source components. Even though these components facilitate sharing and reusing code along with other benefits related to maintenance and code quality, they are often the seeds of vulnerabilities in the software supply chain leading to attacks with severe consequences. Indeed, one common strategy used to conduct attacks is to exploit or inject other security flaws in new versions of dependency packages. It is thus important to keep dependencies updated in a software development project. Unfortunately, several prior studies have highlighted that, to a large extent, developers struggle to keep track of the dependency package updates, and do not quickly incorporate security patches. Therefore, automated dependency-update bots have been proposed to mitigate the impact and the emergence of vulnerabilities in open-source projects. In our study, we focus on Dependabot, a dependency management bot that has gained popularity on GitHub recently. It allows developers to keep a lookout on project dependencies and reduce the effort of monitoring the safety of the software supply chain. We performed a large empirical study on dependency updates and security pull requests to understand: (1) the degree and reasons of Dependabot’s popularity; (2) the patterns of developers’ practices and techniques to deal with vulnerabilities in dependencies; (3) the management of security pull requests (PRs), the threat lifetime, and the fix delay; and (4) the factors that significantly correlate with the acceptance of security PRs and fast merges. To that end, we collected a dataset of 9,916,318 pull request-related issues made in 1,743,035 projects on GitHub for more than 10 different programming languages. In addition to the comprehensive quantitative analysis, we performed a manual qualitative analysis on a representative sample of the dataset, and we substantiated our findings by sending a survey to developers that use dependency management tools. Our study shows that Dependabot dominates more than 65% of dependency management activity, mainly due to its efficiency, accessibility, adaptivity, and availability of support. We also found that developers handle dependency vulnerabilities differently, but mainly rely on the automation of PRs generation to upgrade vulnerable dependencies. Interestingly, Dependabot’s and developers’ security PRs are highly accepted, and the automation allows to accelerate their management, so that fixes are applied in less than one day. However, the threat of dependency vulnerabilities remains hidden for 512 days on average, and patches are disclosed after 362 days due to the reliance on the manual effort of security experts. Also, project characteristics, the amount of PR changes, as well as developer and dependency features seem to be highly correlated with the acceptance and fast merges of security PR","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"296 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DDImage: an image reduction based approach for automatically explaining black-box classifiers DDImage:基于图像还原的黑盒分类器自动解释方法
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-30 DOI: 10.1007/s10664-024-10505-0
Mingyue Jiang, Chengjian Tang, Xiao-Yi Zhang, Yangyang Zhao, Zuohua Ding

Due to the prevalent application of machine learning (ML) techniques and the intrinsic black-box nature of ML models, the need for good explanations that are sufficient and necessary towards locally interpreting a model’s prediction has been well recognized and emphasized. Existing explanation approaches, however, favor either the sufficiency or necessity. To fill this gap, in this paper, we propose an approach for generating local explanations that are both sufficient and necessary. Our approach, DDImage, automatically produces local explanations for ML-based image classifiers in a post-hoc way. The core idea behind DDImage is to discover an appropriate explanation by debugging the given input image via a series of image reductions, with respect to the sufficiency and necessity properties. Evaluation of DDImage using publicly available datasets and popular classification models reveals its effectiveness and efficiency. Compared with three state-of-the-art approaches, DDImage demonstrates a superior performance in producing small-sized explanations preserving both sufficiency and necessity, and it also shows promising stability and efficiency. We also identify the impact of segmentation granularity, reveal the performance variance for different target models, and further show that our approach is applicable across different problem domains.

由于机器学习(ML)技术的普遍应用和 ML 模型固有的黑箱性质,人们充分认识到并强调了对模型预测进行充分和必要的良好解释的必要性。然而,现有的解释方法倾向于充分性或必要性。为了填补这一空白,我们在本文中提出了一种生成既充分又必要的本地解释的方法。我们的方法,即 DDImage,能以事后的方式为基于 ML 的图像分类器自动生成局部解释。DDImage 背后的核心理念是通过一系列图像还原来调试给定的输入图像,从而根据充分性和必要性属性发现适当的解释。使用公开可用的数据集和流行的分类模型对 DDImage 进行的评估显示了其有效性和效率。与三种最先进的方法相比,DDImage 在生成同时保留充分性和必要性的小尺寸解释方面表现出色,而且还显示出良好的稳定性和效率。我们还确定了细分粒度的影响,揭示了不同目标模型的性能差异,并进一步表明我们的方法适用于不同的问题领域。
{"title":"DDImage: an image reduction based approach for automatically explaining black-box classifiers","authors":"Mingyue Jiang, Chengjian Tang, Xiao-Yi Zhang, Yangyang Zhao, Zuohua Ding","doi":"10.1007/s10664-024-10505-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10505-0","url":null,"abstract":"<p>Due to the prevalent application of machine learning (ML) techniques and the intrinsic black-box nature of ML models, the need for good explanations that are sufficient and necessary towards locally interpreting a model’s prediction has been well recognized and emphasized. Existing explanation approaches, however, favor either the sufficiency or necessity. To fill this gap, in this paper, we propose an approach for generating local explanations that are both sufficient and necessary. Our approach, DDImage, automatically produces local explanations for ML-based image classifiers in a post-hoc way. The core idea behind DDImage is to discover an appropriate explanation by debugging the given input image via a series of image reductions, with respect to the sufficiency and necessity properties. Evaluation of DDImage using publicly available datasets and popular classification models reveals its effectiveness and efficiency. Compared with three state-of-the-art approaches, DDImage demonstrates a superior performance in producing small-sized explanations preserving both sufficiency and necessity, and it also shows promising stability and efficiency. We also identify the impact of segmentation granularity, reveal the performance variance for different target models, and further show that our approach is applicable across different problem domains.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"214 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adopting automated bug assignment in practice — a longitudinal case study at Ericsson 在实践中采用自动错误分派--爱立信纵向案例研究
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-30 DOI: 10.1007/s10664-024-10507-y
Markus Borg, Leif Jonsson, Emelie Engström, Béla Bartalos, Attila Szabó

[Context] The continuous inflow of bug reports is a considerable challenge in large development projects. Inspired by contemporary work on mining software repositories, we designed a prototype bug assignment solution based on machine learning in 2011-2016. The prototype evolved into an internal Ericsson product, TRR, in 2017-2018. TRR’s first bug assignment without human intervention happened in April 2019. [Objective] Our study evaluates the adoption of TRR within its industrial context at Ericsson, i.e., we provide lessons learned related to the productization of a research prototype within a company. Moreover, we investigate 1) how TRR performs in the field, 2) what value TRR provides to Ericsson, and 3) how TRR has influenced the ways of working. [Method] We conduct a preregistered industrial case study combining interviews with TRR stakeholders, minutes from sprint planning meetings, and bug-tracking data. The data analysis includes thematic analysis, descriptive statistics, and Bayesian causal analysis. [Results] TRR is now an incorporated part of the bug assignment process. Considering the abstraction levels of the telecommunications stack, high-level modules are more positive while low-level modules experienced some drawbacks. Most importantly, some bug reports directly reach low-level modules without first having passed through fundamental root-cause analysis steps at higher levels. On average, TRR automatically assigns 30% of the incoming bug reports with an accuracy of 75%. Auto-routed TRs are resolved around 21% faster within Ericsson, and TRR has saved highly seasoned engineers many hours of work. Indirect effects of adopting TRR include process improvements, process awareness, increased communication, and higher job satisfaction. [Conclusions] TRR has saved time at Ericsson, but the adoption of automated bug assignment was more intricate compared to similar endeavors reported from other companies. We primarily attribute the difference to the very large size of the organization and the complex products. Key facilitators in the successful adoption include a gradual introduction, product champions, and careful stakeholder analysis.

[背景] 在大型开发项目中,不断涌入的错误报告是一个相当大的挑战。受当代软件库挖掘工作的启发,我们在 2011-2016 年设计了一个基于机器学习的错误分配解决方案原型。该原型于 2017-2018 年发展成为爱立信内部产品 TRR。2019 年 4 月,TRR 在没有人工干预的情况下完成了首次错误分派。[目标]我们的研究评估了爱立信在工业背景下采用 TRR 的情况,即提供了与公司内部研究原型产品化相关的经验教训。此外,我们还调查了 1) TRR 在现场的表现,2) TRR 为爱立信带来的价值,以及 3) TRR 如何影响了工作方式。[方法] 我们结合对 TRR 利益相关者的访谈、冲刺计划会议记录和错误跟踪数据,开展了一项预先注册的工业案例研究。数据分析包括主题分析、描述性统计和贝叶斯因果分析。[结果] TRR 现已成为错误分派流程的一部分。考虑到电信栈的抽象层次,高层模块更积极,而低层模块则存在一些缺陷。最重要的是,一些错误报告直接到达低层模块,而没有首先通过高层的基本根源分析步骤。平均而言,TRR 自动分配了 30% 的错误报告,准确率为 75%。在爱立信内部,自动路由 TR 的解决速度提高了约 21%,TRR 为经验丰富的工程师节省了许多工作时间。采用 TRR 的间接效果包括流程改进、流程意识、沟通增加和工作满意度提高。[结论] TRR 为爱立信节省了时间,但与其他公司报告的类似工作相比,自动错误分配的采用更为复杂。我们将这种差异主要归因于该公司的庞大规模和复杂的产品。成功采用的关键因素包括逐步引入、产品拥护者和对利益相关者的仔细分析。
{"title":"Adopting automated bug assignment in practice — a longitudinal case study at Ericsson","authors":"Markus Borg, Leif Jonsson, Emelie Engström, Béla Bartalos, Attila Szabó","doi":"10.1007/s10664-024-10507-y","DOIUrl":"https://doi.org/10.1007/s10664-024-10507-y","url":null,"abstract":"<p>[Context] The continuous inflow of bug reports is a considerable challenge in large development projects. Inspired by contemporary work on mining software repositories, we designed a prototype bug assignment solution based on machine learning in 2011-2016. The prototype evolved into an internal Ericsson product, TRR, in 2017-2018. TRR’s first bug assignment without human intervention happened in April 2019. [Objective] Our study evaluates the adoption of TRR within its industrial context at Ericsson, i.e., we provide lessons learned related to the productization of a research prototype within a company. Moreover, we investigate 1) how TRR performs in the field, 2) what value TRR provides to Ericsson, and 3) how TRR has influenced the ways of working. [Method] We conduct a preregistered industrial case study combining interviews with TRR stakeholders, minutes from sprint planning meetings, and bug-tracking data. The data analysis includes thematic analysis, descriptive statistics, and Bayesian causal analysis. [Results] TRR is now an incorporated part of the bug assignment process. Considering the abstraction levels of the telecommunications stack, high-level modules are more positive while low-level modules experienced some drawbacks. Most importantly, some bug reports directly reach low-level modules without first having passed through fundamental root-cause analysis steps at higher levels. On average, TRR automatically assigns 30% of the incoming bug reports with an accuracy of 75%. Auto-routed TRs are resolved around 21% faster within Ericsson, and TRR has saved highly seasoned engineers many hours of work. Indirect effects of adopting TRR include process improvements, process awareness, increased communication, and higher job satisfaction. [Conclusions] TRR has saved time at Ericsson, but the adoption of automated bug assignment was more intricate compared to similar endeavors reported from other companies. We primarily attribute the difference to the very large size of the organization and the complex products. Key facilitators in the successful adoption include a gradual introduction, product champions, and careful stakeholder analysis.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"74 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Testing the past: can we still run tests in past snapshots for Java projects? 测试过去:我们还能在 Java 项目的过去快照中运行测试吗?
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-30 DOI: 10.1007/s10664-024-10530-z
Michel Maes-Bermejo, Micael Gallego, Francisco Gortázar, Gregorio Robles, Jesus M. Gonzalez-Barahona

Building past snapshots of a software project has shown to be of interest both for researchers and practitioners. However, little attention has been devoted specifically to tests available in those past snapshots, which are fundamental for the maintenance of old versions still in production. The aim of this study is to determine to which extent tests of past snapshots can be executed successfully, which would mean these past snapshots are still testable. Given a software project, we build all its past snapshots from source code, including tests, and then run the tests. When tests do not result in success, we also record the reasons, allowing us to determine factors that make tests fail. We run this method on a total of 86 Java projects. On average, for 52.53% of the project snapshots on which tests can be built, all tests pass. However, on average, 94.14% of tests pass in previous snapshots when we take into account the percentage of tests passing in the snapshots used for building those tests. In real software projects, successfully running tests in past snapshots is not something that we can take for granted: we have found that in a large proportion of the projects we studied this does not happen frequently. We have found that the building from source code is the main limitation when running tests on past snapshots. However, we have found some projects for which tests run successfully in a very large fraction of past snapshots, which allows us to identify good practices. We also provide a framework and metrics to quantify testability (the extent to which we are able to run tests of a snapshot with a success result) of past snapshots from several points of view, which simplifies new analyses on this matter, and could help to measure how any project performs in this respect.

建立软件项目的过往快照已显示出研究人员和从业人员的兴趣所在。然而,很少有人专门关注这些过去快照中的测试,而这些测试对于维护仍在生产中的旧版本至关重要。本研究的目的是确定过去快照的测试在多大程度上可以成功执行,这意味着这些过去的快照仍然是可测试的。给定一个软件项目,我们从源代码中构建其过去的所有快照,包括测试,然后运行测试。当测试没有成功时,我们也会记录原因,从而确定导致测试失败的因素。我们在总共 86 个 Java 项目上运行了这种方法。平均而言,在 52.53% 的可构建测试的项目快照中,所有测试都通过了。但是,如果考虑到用于构建测试的快照中测试通过的百分比,则平均有 94.14% 的测试在之前的快照中通过。在实际软件项目中,成功运行过去快照中的测试并非理所当然:我们发现,在我们研究的大部分项目中,这种情况并不常见。我们发现,在过去快照中运行测试时,从源代码构建是主要的限制因素。不过,我们也发现有些项目的测试在很大一部分过去快照中都能成功运行,这使我们能够识别良好的实践。我们还提供了一个框架和衡量标准,从多个角度量化过去快照的可测试性(我们能在多大程度上对快照运行测试并取得成功结果),从而简化了对这一问题的新分析,并有助于衡量任何项目在这方面的表现。
{"title":"Testing the past: can we still run tests in past snapshots for Java projects?","authors":"Michel Maes-Bermejo, Micael Gallego, Francisco Gortázar, Gregorio Robles, Jesus M. Gonzalez-Barahona","doi":"10.1007/s10664-024-10530-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10530-z","url":null,"abstract":"<p>Building past snapshots of a software project has shown to be of interest both for researchers and practitioners. However, little attention has been devoted specifically to tests available in those past snapshots, which are fundamental for the maintenance of old versions still in production. The aim of this study is to determine to which extent tests of past snapshots can be executed successfully, which would mean these past snapshots are still testable. Given a software project, we build all its past snapshots from source code, including tests, and then run the tests. When tests do not result in success, we also record the reasons, allowing us to determine factors that make tests fail. We run this method on a total of 86 Java projects. On average, for 52.53% of the project snapshots on which tests can be built, all tests pass. However, on average, 94.14% of tests pass in previous snapshots when we take into account the percentage of tests passing in the snapshots used for building those tests. In real software projects, successfully running tests in past snapshots is not something that we can take for granted: we have found that in a large proportion of the projects we studied this does not happen frequently. We have found that the building from source code is the main limitation when running tests on past snapshots. However, we have found some projects for which tests run successfully in a very large fraction of past snapshots, which allows us to identify good practices. We also provide a framework and metrics to quantify testability (the extent to which we are able to run tests of a snapshot with a success result) of past snapshots from several points of view, which simplifies new analyses on this matter, and could help to measure how any project performs in this respect.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"78 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Investigating the online recruitment and selection journey of novice software engineers: Anti-patterns and recommendations 调查软件工程师新手的在线招聘和选拔历程:反模式与建议
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-30 DOI: 10.1007/s10664-024-10498-w
Miguel Setúbal, Tayana Conte, Marcos Kalinowski, Allysson Allex Araújo

The growing software development market has increased the demand for qualified professionals in Software Engineering (SE). To this end, companies must enhance their Recruitment and Selection (R&S) processes to maintain high-quality teams, including opening opportunities for beginners, such as trainees and interns. However, given the various judgments and sociotechnical factors involved, this complex process of R&S poses a challenge for recent graduates seeking to enter the market. This paper aims to identify a set of anti-patterns and recommendations for early career SE professionals concerning R&S processes. Under an exploratory and qualitative methodological approach, we conducted six online Focus Groups with 18 recruiters with experience in R&S in the software industry. After completing our qualitative analysis, we identified 12 anti-patterns and 31 actionable recommendations regarding the hiring process focused on entry-level SE professionals. The identified anti-patterns encompass behavioral and technical dimensions innate to R&S processes. These findings provide a rich opportunity for reflection in the SE industry and offer valuable guidance for early-career candidates and organizations. From an academic perspective, this work also raises awareness of the intersection of Human Resources and SE, an area with considerable potential to be expanded in the context of cooperative and human aspects of SE.

不断增长的软件开发市场增加了对软件工程 (SE) 合格专业人员的需求。为此,公司必须加强招聘和选拔(R&S)流程,以保持高素质的团队,包括为培训生和实习生等初学者提供机会。然而,由于涉及到各种判断和社会技术因素,这一复杂的招聘与甄选过程对寻求进入市场的应届毕业生构成了挑战。本文旨在为初入职场的 SE 专业人士找出一套有关研发和生产流程的反模式和建议。在探索性和定性方法论的指导下,我们与 18 位具有软件行业研发经验的招聘人员进行了六次在线焦点小组讨论。在完成定性分析后,我们针对入门级 SE 专业人员的招聘流程,确定了 12 种反模式和 31 项可行建议。所发现的反模式包括 R&S 流程中与生俱来的行为和技术层面。这些发现为 SE 行业提供了丰富的反思机会,并为早期职业候选人和组织提供了宝贵的指导。从学术角度看,这项工作还提高了人们对人力资源与社会企业交叉的认识,在社会企业的合作与人文方面,这一领域具有相当大的扩展潜力。
{"title":"Investigating the online recruitment and selection journey of novice software engineers: Anti-patterns and recommendations","authors":"Miguel Setúbal, Tayana Conte, Marcos Kalinowski, Allysson Allex Araújo","doi":"10.1007/s10664-024-10498-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10498-w","url":null,"abstract":"<p>The growing software development market has increased the demand for qualified professionals in Software Engineering (SE). To this end, companies must enhance their Recruitment and Selection (R&amp;S) processes to maintain high-quality teams, including opening opportunities for beginners, such as trainees and interns. However, given the various judgments and sociotechnical factors involved, this complex process of R&amp;S poses a challenge for recent graduates seeking to enter the market. This paper aims to identify a set of anti-patterns and recommendations for early career SE professionals concerning R&amp;S processes. Under an exploratory and qualitative methodological approach, we conducted six online Focus Groups with 18 recruiters with experience in R&amp;S in the software industry. After completing our qualitative analysis, we identified 12 anti-patterns and 31 actionable recommendations regarding the hiring process focused on entry-level SE professionals. The identified anti-patterns encompass behavioral and technical dimensions innate to R&amp;S processes. These findings provide a rich opportunity for reflection in the SE industry and offer valuable guidance for early-career candidates and organizations. From an academic perspective, this work also raises awareness of the intersection of Human Resources and SE, an area with considerable potential to be expanded in the context of cooperative and human aspects of SE.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"44 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
App review driven collaborative bug finding 应用程序审查驱动的协作式错误查找
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-26 DOI: 10.1007/s10664-024-10489-x
Xunzhu Tang, Haoye Tian, Pingfan Kong, Saad Ezzini, Kui Liu, Xin Xia, Jacques Klein, Tegawendé F. Bissyandé

Software development teams generally welcome any effort to expose bugs in their code base. In this work, we build on the hypothesis that mobile apps from the same category (e.g., two web browser apps) may be affected by similar bugs in their evolution process. It is therefore possible to transfer the experience of one historical app to quickly find bugs in its new counterparts. This has been referred to as collaborative bug finding in the literature. Our novelty is that we guide the bug finding process by considering that existing bugs have been hinted within app reviews. Concretely, we design the BugRMSys approach to recommend bug reports for a target app by matching historical bug reports from apps in the same category with user app reviews of the target app. We experimentally show that this approach enables us to quickly expose and report dozens of bugs for targeted apps such as Brave (web browser app). BugRMSys ’s implementation relies on DistilBERT to produce natural language text embeddings. Our pipeline considers similarities between bug reports and app reviews to identify relevant bugs. We then focus on the app review as well as potential reproduction steps in the historical bug report (from a same-category app) to reproduce the bugs. Overall, after applying BugRMSys to six popular apps, we were able to identify, reproduce and report 20 new bugs: among these, 9 reports have been already triaged, 6 were confirmed, and 4 have been fixed by official development teams.

软件开发团队通常欢迎任何暴露代码库中错误的努力。在这项工作中,我们提出了一个假设,即同一类别的移动应用程序(如两个网络浏览器应用程序)在其演化过程中可能会受到类似错误的影响。因此,有可能将一个历史应用程序的经验用于快速查找其新的对应程序中的错误。这在文献中被称为协作式错误查找。我们的新颖之处在于,我们通过考虑应用程序评论中提示的现有错误来指导错误查找过程。具体来说,我们设计了 BugRMSys 方法,通过匹配同类应用程序的历史错误报告和用户对目标应用程序的评论,为目标应用程序推荐错误报告。我们通过实验证明,这种方法能让我们快速揭露和报告目标应用程序(如 Brave(网页浏览器应用程序))的数十个错误。BugRMSys 的实现依赖于 DistilBERT 生成自然语言文本嵌入。我们的管道考虑了错误报告和应用程序评论之间的相似性,以识别相关的错误。然后,我们关注应用程序评论以及历史错误报告(来自同类应用程序)中的潜在重现步骤,以重现错误。总体而言,在将 BugRMSys 应用于六款流行应用程序后,我们能够识别、重现并报告 20 个新错误:其中,9 个报告已被分流,6 个被确认,4 个已由官方开发团队修复。
{"title":"App review driven collaborative bug finding","authors":"Xunzhu Tang, Haoye Tian, Pingfan Kong, Saad Ezzini, Kui Liu, Xin Xia, Jacques Klein, Tegawendé F. Bissyandé","doi":"10.1007/s10664-024-10489-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10489-x","url":null,"abstract":"<p>Software development teams generally welcome any effort to expose bugs in their code base. In this work, we build on the hypothesis that mobile apps from the same category (e.g., two web browser apps) may be affected by similar bugs in their evolution process. It is therefore possible to transfer the experience of one historical app to quickly find bugs in its new counterparts. This has been referred to as collaborative bug finding in the literature. Our novelty is that we guide the bug finding process by considering that existing bugs have been hinted within app reviews. Concretely, we design the <span>BugRMSys</span> approach to recommend bug reports for a target app by matching historical bug reports from apps in the same category with user app reviews of the target app. We experimentally show that this approach enables us to quickly expose and report dozens of bugs for targeted apps such as Brave (web browser app). <span>BugRMSys</span> ’s implementation relies on DistilBERT to produce natural language text embeddings. Our pipeline considers similarities between bug reports and app reviews to identify relevant bugs. We then focus on the app review as well as potential reproduction steps in the historical bug report (from a same-category app) to reproduce the bugs. Overall, after applying <span>BugRMSys</span> to six popular apps, we were able to identify, reproduce and report 20 new bugs: among these, 9 reports have been already triaged, 6 were confirmed, and 4 have been fixed by official development teams.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"245 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neuron importance-aware coverage analysis for deep neural network testing 用于深度神经网络测试的神经元重要性感知覆盖率分析
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-25 DOI: 10.1007/s10664-024-10524-x
Hongjing Guo, Chuanqi Tao, Zhiqiu Huang

Deep Neural Network (DNN) models are widely used in many cutting-edge domains, such as medical diagnostics and autonomous driving. However, an urgent need to test DNN models thoroughly has increasingly risen. Recent research proposes various structural and non-structural coverage criteria to measure test adequacy. Structural coverage criteria quantify the degree to which the internal elements of DNN models are covered by a test suite. However, they convey little information about individual inputs and exhibit limited correlation with defect detection. Additionally, existing non-structural coverage criteria are unaware of neurons’ importance to decision-making. This paper addresses these limitations by proposing novel non-structural coverage criteria. By tracing neurons’ cumulative contribution to the final decision on the training set, this paper identifies important neurons of DNN models. A novel metric is proposed to quantify the difference in important neuron behavior between a test input and the training set, which provides a measured way at individual test input granularity. Additionally, two non-structural coverage criteria are introduced that allow for the quantification of test adequacy by examining differences in important neuron behavior between the testing and the training set. The empirical evaluation of image datasets demonstrates that the proposed metric outperforms the existing non-structural adequacy metrics by up to 14.7% accuracy improvement in capturing error-revealing test inputs. Compared with state-of-the-art coverage criteria, the proposed coverage criteria are more sensitive to errors, including natural errors and adversarial examples.

深度神经网络(DNN)模型被广泛应用于许多前沿领域,如医疗诊断和自动驾驶。然而,对 DNN 模型进行全面测试的迫切需求日益高涨。最近的研究提出了各种结构性和非结构性覆盖标准来衡量测试的充分性。结构覆盖标准量化 DNN 模型内部元素被测试套件覆盖的程度。然而,这些标准传达的单个输入信息很少,与缺陷检测的相关性有限。此外,现有的非结构覆盖标准没有意识到神经元对决策的重要性。本文通过提出新型非结构覆盖标准来解决这些局限性。通过追踪神经元对训练集最终决策的累积贡献,本文确定了 DNN 模型的重要神经元。本文提出了一种新的度量标准,用于量化测试输入与训练集之间重要神经元行为的差异,从而提供了单个测试输入粒度的测量方法。此外,还引入了两个非结构覆盖标准,通过检查测试集和训练集之间重要神经元行为的差异,量化测试的充分性。对图像数据集的实证评估表明,在捕捉揭示错误的测试输入方面,所提出的指标优于现有的非结构充分性指标,准确率提高了 14.7%。与最先进的覆盖率标准相比,所提出的覆盖率标准对错误(包括自然错误和对抗性示例)更加敏感。
{"title":"Neuron importance-aware coverage analysis for deep neural network testing","authors":"Hongjing Guo, Chuanqi Tao, Zhiqiu Huang","doi":"10.1007/s10664-024-10524-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10524-x","url":null,"abstract":"<p>Deep Neural Network (DNN) models are widely used in many cutting-edge domains, such as medical diagnostics and autonomous driving. However, an urgent need to test DNN models thoroughly has increasingly risen. Recent research proposes various structural and non-structural coverage criteria to measure test adequacy. Structural coverage criteria quantify the degree to which the internal elements of DNN models are covered by a test suite. However, they convey little information about individual inputs and exhibit limited correlation with defect detection. Additionally, existing non-structural coverage criteria are unaware of neurons’ importance to decision-making. This paper addresses these limitations by proposing novel non-structural coverage criteria. By tracing neurons’ cumulative contribution to the final decision on the training set, this paper identifies important neurons of DNN models. A novel metric is proposed to quantify the difference in important neuron behavior between a test input and the training set, which provides a measured way at individual test input granularity. Additionally, two non-structural coverage criteria are introduced that allow for the quantification of test adequacy by examining differences in important neuron behavior between the testing and the training set. The empirical evaluation of image datasets demonstrates that the proposed metric outperforms the existing non-structural adequacy metrics by up to 14.7% accuracy improvement in capturing error-revealing test inputs. Compared with state-of-the-art coverage criteria, the proposed coverage criteria are more sensitive to errors, including natural errors and adversarial examples.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"90 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An empirical study on the potential of word embedding techniques in bug report management tasks 单词嵌入技术在错误报告管理任务中的潜力实证研究
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-25 DOI: 10.1007/s10664-024-10510-3
Bingting Chen, Weiqin Zou, Biyu Cai, Qianshuang Meng, Wenjie Liu, Piji Li, Lin Chen

Context

Representing the textual semantics of bug reports is a key component of bug report management (BRM) techniques. Existing studies mainly use classical information retrieval-based (IR-based) approaches, such as the vector space model (VSM) to do semantic extraction. Little attention is paid to exploring whether word embedding (WE) models from the natural language process could help BRM tasks.

Objective

To have a general view of the potential of word embedding models in representing the semantics of bug reports and attempt to provide some actionable guidelines in using semantic retrieval models for BRM tasks.

Method

We studied the efficacy of five widely recognized WE models for six BRM tasks on 20 widely-used products from the Eclipse and Mozilla foundations. Specifically, we first explored the suitable machine learning techniques under the use of WE models and the suitable WE model for BRM tasks. Then we studied whether WE models performed better than classical VSM. Last, we investigated whether WE models fine-tuned with bug reports outperformed general pre-trained WE models.

Key Results

The Random Forest (RF) classifier outperformed other typical classifiers under the use of different WE models in semantic extraction.We rarely observed statistically significant performance differences among five WE models in five BRM classification tasks, but we found that small-dimensional WE models performed better than larger ones in the duplicate bug report detection task. Among three BRM tasks (i.e., bug severity prediction, reopened bug prediction, and duplicate bug report detection) that showed statistically significant performance differences, VSM outperformed the studied WE models. We did not find performance improvement after we fine-tuned general pre-trained BERT with bug report data.

Conclusion

Performance improvements of using pre-trained WE models were not observed in studied BRM tasks. The combination of RF and traditional VSM was found to achieve the best performance in various BRM tasks.

背景呈现错误报告的文本语义是错误报告管理(BRM)技术的关键组成部分。现有研究主要使用经典的基于信息检索(IR)的方法,如向量空间模型(VSM)来进行语义提取。我们在 Eclipse 和 Mozilla 基金会的 20 种广泛使用的产品上研究了五种广受认可的 WE 模型在六种 BRM 任务中的功效。具体来说,我们首先探讨了在使用 WE 模型时适合的机器学习技术,以及适合 BRM 任务的 WE 模型。然后,我们研究了 WE 模型的性能是否优于经典的 VSM。在五项 BRM 分类任务中,我们很少观察到五种 WE 模型之间存在统计学意义上的显著性能差异,但我们发现,在重复错误报告检测任务中,小维度 WE 模型的性能优于大维度 WE 模型。在表现出显著统计学差异的三个 BRM 任务(即错误严重性预测、重新打开的错误预测和重复错误报告检测)中,VSM 的表现优于所研究的 WE 模型。在使用错误报告数据对一般预训练 BERT 进行微调后,我们没有发现性能的提高。我们发现 RF 与传统 VSM 的组合在各种 BRM 任务中取得了最佳性能。
{"title":"An empirical study on the potential of word embedding techniques in bug report management tasks","authors":"Bingting Chen, Weiqin Zou, Biyu Cai, Qianshuang Meng, Wenjie Liu, Piji Li, Lin Chen","doi":"10.1007/s10664-024-10510-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10510-3","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Representing the textual semantics of bug reports is a key component of bug report management (BRM) techniques. Existing studies mainly use classical information retrieval-based (IR-based) approaches, such as the vector space model (VSM) to do semantic extraction. Little attention is paid to exploring whether word embedding (WE) models from the natural language process could help BRM tasks.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>To have a general view of the potential of word embedding models in representing the semantics of bug reports and attempt to provide some actionable guidelines in using semantic retrieval models for BRM tasks.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We studied the efficacy of five widely recognized WE models for six BRM tasks on 20 widely-used products from the Eclipse and Mozilla foundations. Specifically, we first explored the suitable machine learning techniques under the use of WE models and the suitable WE model for BRM tasks. Then we studied whether WE models performed better than classical VSM. Last, we investigated whether WE models fine-tuned with bug reports outperformed general pre-trained WE models.</p><h3 data-test=\"abstract-sub-heading\">Key Results</h3><p>The Random Forest (RF) classifier outperformed other typical classifiers under the use of different WE models in semantic extraction.We rarely observed statistically significant performance differences among five WE models in five BRM classification tasks, but we found that small-dimensional WE models performed better than larger ones in the duplicate bug report detection task. Among three BRM tasks (i.e., bug severity prediction, reopened bug prediction, and duplicate bug report detection) that showed statistically significant performance differences, VSM outperformed the studied WE models. We did not find performance improvement after we fine-tuned general pre-trained BERT with bug report data.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>Performance improvements of using pre-trained WE models were not observed in studied BRM tasks. The combination of RF and traditional VSM was found to achieve the best performance in various BRM tasks.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"55 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The role of psychological safety in promoting software quality in agile teams 心理安全对提高敏捷团队软件质量的作用
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-25 DOI: 10.1007/s10664-024-10512-1
Adam Alami, Mansooreh Zahedi, Oliver Krancher

Psychological safety continues to pique the interest of scholars in a variety of disciplines of study. Recent research indicates that psychological safety fosters knowledge sharing and norm clarity and complements agile values. Although software quality remains a concern in the software industry, academics have yet to investigate whether and how psychologically safe teams provide superior results. In this study, we explore how psychological safety influences agile teams’ quality-related behaviors aimed at enhancing software quality. To widen the empirical coverage and evaluate the results, we chose a two-phase mixed-methods research design with an exploratory qualitative phase (20 interviews) followed by a quantitative phase (survey study, N = 423). Our findings show that, when psychological safety is established in agile software teams, it induces enablers of a social nature that advance the teams’ ability to pursue software quality. For example, admitting mistakes and taking initiatives equally help teams learn and invest their learning in their future decisions related to software quality. Past mistakes become points of reference for avoiding them in the future. Individuals become more willing to take initiatives aimed at enhancing quality practices and mitigating software quality issues. We contribute to our endeavor to understand the circumstances that promote software quality. Psychological safety requires organizations, their management, agile teams, and individuals to maintain and propagate safety principles. Our results also suggest that technological tools and procedures can be utilized alongside social strategies to promote software quality.

心理安全继续引起各学科学者的兴趣。最近的研究表明,心理安全能够促进知识共享和规范清晰,并与敏捷价值观相辅相成。尽管软件质量仍然是软件行业关注的问题,但学术界尚未研究心理安全团队是否以及如何提供卓越的成果。在本研究中,我们将探讨心理安全如何影响敏捷团队与质量相关的行为,从而提高软件质量。为了扩大实证研究的覆盖面并对结果进行评估,我们选择了两阶段混合方法研究设计,在探索性定性阶段(20 次访谈)之后是定量阶段(调查研究,N = 423)。我们的研究结果表明,当心理安全在敏捷软件团队中建立起来时,就会产生社会性的促进因素,从而提高团队追求软件质量的能力。例如,承认错误和采取主动同样有助于团队学习,并将学习成果投入到未来与软件质量相关的决策中。过去的错误成为今后避免错误的参照点。个人会更愿意采取主动行动,加强质量实践,减少软件质量问题。我们致力于了解促进软件质量的环境。心理安全要求组织、管理层、敏捷团队和个人维护和宣传安全原则。我们的研究结果还表明,技术工具和程序可与社会策略并用,以促进软件质量。
{"title":"The role of psychological safety in promoting software quality in agile teams","authors":"Adam Alami, Mansooreh Zahedi, Oliver Krancher","doi":"10.1007/s10664-024-10512-1","DOIUrl":"https://doi.org/10.1007/s10664-024-10512-1","url":null,"abstract":"<p>Psychological safety continues to pique the interest of scholars in a variety of disciplines of study. Recent research indicates that psychological safety fosters knowledge sharing and norm clarity and complements agile values. Although software quality remains a concern in the software industry, academics have yet to investigate whether and how psychologically safe teams provide superior results. In this study, we explore how psychological safety influences agile teams’ quality-related behaviors aimed at enhancing software quality. To widen the empirical coverage and evaluate the results, we chose a two-phase mixed-methods research design with an exploratory qualitative phase (20 interviews) followed by a quantitative phase (survey study, N = 423). Our findings show that, when psychological safety is established in agile software teams, it induces enablers of a social nature that advance the teams’ ability to pursue software quality. For example, admitting mistakes and taking initiatives equally help teams learn and invest their learning in their future decisions related to software quality. Past mistakes become points of reference for avoiding them in the future. Individuals become more willing to take initiatives aimed at enhancing quality practices and mitigating software quality issues. We contribute to our endeavor to understand the circumstances that promote software quality. Psychological safety requires organizations, their management, agile teams, and individuals to maintain and propagate safety principles. Our results also suggest that technological tools and procedures can be utilized alongside social strategies to promote software quality.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"12 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The impact of concept drift and data leakage on log level prediction models 概念漂移和数据泄露对日志级别预测模型的影响
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-25 DOI: 10.1007/s10664-024-10518-9
Youssef Esseddiq Ouatiti, Mohammed Sayagh, Noureddine Kerzazi, Bram Adams, Ahmed E. Hassan

Developers insert logging statements to collect information about the execution of their systems. Along with a logging framework (e.g., Log4j), practitioners can decide which log statement to print or suppress by tagging each log line with a log level. Since picking the right log level for a new logging statement is not straightforward, machine learning models for log level prediction (LLP) were proposed by prior studies. While these models show good performances, they are still subject to the context in which they are applied, specifically to the way practitioners decide on log levels in different phases of the development history of their projects (e.g., debugging vs. testing). For example, Openstack developers interchangeably increased/decreased the verbosity of their logs across the history of the project in response to code changes (e.g., before vs after fixing a new bug). Thus, the manifestation of these changing log verbosity choices across time can lead to concept drift and data leakage issues, which we wish to quantify in this paper on LLP models. In this paper, we empirically quantify the impact of data leakage and concept drift on the performance and interpretability of LLP models in three large open-source systems. Additionally, we compare the performance and interpretability of several time-aware approaches to tackle time-related issues. We observe that both shallow and deep-learning-based models suffer from both time-related issues. We also observe that training a model on just a window of the historical data (i.e., contextual model) outperforms models that are trained on the whole historical data (i.e., all-knowing model) in the case of our shallow LLP model. Finally, we observe that contextual models exhibit a different (even contradictory) model interpretability, with a (very) weak correlation between the ranking of important features of the pairs of contextual models we compared. Our findings suggest that data leakage and concept drift should be taken into consideration for LLP models. We also invite practitioners to include the size of the historical window as an additional hyperparameter to tune a suitable contextual model instead of leveraging all-knowing models.

开发人员插入日志语句来收集系统执行信息。通过日志框架(如 Log4j),从业人员可以为每行日志标记日志级别,从而决定打印或抑制哪条日志语句。由于为新日志语句选择正确的日志级别并不简单,先前的研究提出了用于日志级别预测(LLP)的机器学习模型。虽然这些模型表现出了良好的性能,但它们仍然受到应用环境的影响,特别是从业人员在项目开发历史的不同阶段(如调试与测试)决定日志级别的方式。例如,Openstack 开发人员根据代码变化(如修复新错误之前和之后),在整个项目历史中交替增加/减少日志的冗长程度。因此,这些随时间变化的日志冗余度选择的表现形式可能会导致概念漂移和数据泄漏问题,我们希望在本文中对 LLP 模型进行量化。在本文中,我们在三个大型开源系统中实证量化了数据泄漏和概念漂移对 LLP 模型性能和可解释性的影响。此外,我们还比较了几种时间感知方法的性能和可解释性,以解决与时间相关的问题。我们发现,基于浅层学习和深度学习的模型都会受到时间相关问题的影响。我们还观察到,在我们的浅层 LLP 模型中,仅在历史数据的一个窗口上训练模型(即上下文模型)的效果优于在整个历史数据上训练的模型(即全知模型)。最后,我们观察到,上下文模型表现出不同(甚至矛盾)的模型可解释性,我们比较过的成对上下文模型的重要特征排序之间存在(非常)微弱的相关性。我们的研究结果表明,LLP 模型应考虑数据泄漏和概念漂移。我们还请实践者将历史窗口的大小作为额外的超参数,以调整合适的上下文模型,而不是利用全知模型。
{"title":"The impact of concept drift and data leakage on log level prediction models","authors":"Youssef Esseddiq Ouatiti, Mohammed Sayagh, Noureddine Kerzazi, Bram Adams, Ahmed E. Hassan","doi":"10.1007/s10664-024-10518-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10518-9","url":null,"abstract":"<p>Developers insert logging statements to collect information about the execution of their systems. Along with a logging framework (e.g., Log4j), practitioners can decide which log statement to print or suppress by tagging each log line with a log level. Since picking the right log level for a new logging statement is not straightforward, machine learning models for log level prediction (LLP) were proposed by prior studies. While these models show good performances, they are still subject to the context in which they are applied, specifically to the way practitioners decide on log levels in different phases of the development history of their projects (e.g., debugging vs. testing). For example, Openstack developers interchangeably increased/decreased the verbosity of their logs across the history of the project in response to code changes (e.g., before vs after fixing a new bug). Thus, the manifestation of these changing log verbosity choices across time can lead to concept drift and data leakage issues, which we wish to quantify in this paper on LLP models. In this paper, we empirically quantify the impact of data leakage and concept drift on the performance and interpretability of LLP models in three large open-source systems. Additionally, we compare the performance and interpretability of several time-aware approaches to tackle time-related issues. We observe that both shallow and deep-learning-based models suffer from both time-related issues. We also observe that training a model on just a window of the historical data (i.e., contextual model) outperforms models that are trained on the whole historical data (i.e., all-knowing model) in the case of our shallow LLP model. Finally, we observe that contextual models exhibit a different (even contradictory) model interpretability, with a (very) weak correlation between the ranking of important features of the pairs of contextual models we compared. Our findings suggest that data leakage and concept drift should be taken into consideration for LLP models. We also invite practitioners to include the size of the historical window as an additional hyperparameter to tune a suitable contextual model instead of leveraging all-knowing models.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"16 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Empirical Software Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1