An extensive replication study of the ABLoTS approach for bug localization

IF 3.5 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Empirical Software Engineering Pub Date : 2024-08-24 DOI:10.1007/s10664-024-10537-6

Feifei Niu, Enshuo Zhang, Christoph Mayr-Dorn, Wesley Klewerton Guez Assunção, Liguo Huang, Jidong Ge, Bin Luo, Alexander Egyed

{"title":"An extensive replication study of the ABLoTS approach for bug localization","authors":"Feifei Niu, Enshuo Zhang, Christoph Mayr-Dorn, Wesley Klewerton Guez Assunção, Liguo Huang, Jidong Ge, Bin Luo, Alexander Egyed","doi":"10.1007/s10664-024-10537-6","DOIUrl":null,"url":null,"abstract":"<p>Bug localization is the task of recommending source code locations (typically files) that contain the cause of a bug and hence need to be changed to fix the bug. Along these lines, information retrieval-based bug localization (IRBL) approaches have been adopted, which identify the most bug-prone files from the source code space. In current practice, a series of state-of-the-art IRBL techniques leverage the combination of different components (e.g., similar reports, version history, and code structure) to achieve better performance. ABLoTS is a recently proposed approach with the core component, TraceScore, that utilizes requirements and traceability information between different issue reports (i.e., feature requests and bug reports) to identify buggy source code snippets with promising results. To evaluate the accuracy of these results and obtain additional insights into the practical applicability of ABLoTS, we conducted a replication study of this approach with the original dataset and also on two extended datasets (i.e., additional Java dataset and Python dataset). The original dataset consists of 11 open source Java projects with 8,494 bug reports. The extended Java dataset includes 16 more projects comprising 25,893 bug reports and corresponding source code commits. The extended Python dataset consists of 12 projects with 1,289 bug reports. While we find that the TraceScore component, which is the core of ABLoTS, produces comparable or even better results with the extended datasets, we also find that we cannot reproduce the ABLoTS results, as reported in its original paper, due to an overlooked side effect of incorrectly choosing a cut-off date that led to test data leaking into training data with significant effects on performance. Additionally, we conduct experiments to assess the performance of various composers that aggregate scores from different components, revealing that Logistic Regression, fixed weight, and CombSUM outperform the other composers across all three datasets, while decision tree and random forest exhibited subpar performance.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"181 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10537-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Bug localization is the task of recommending source code locations (typically files) that contain the cause of a bug and hence need to be changed to fix the bug. Along these lines, information retrieval-based bug localization (IRBL) approaches have been adopted, which identify the most bug-prone files from the source code space. In current practice, a series of state-of-the-art IRBL techniques leverage the combination of different components (e.g., similar reports, version history, and code structure) to achieve better performance. ABLoTS is a recently proposed approach with the core component, TraceScore, that utilizes requirements and traceability information between different issue reports (i.e., feature requests and bug reports) to identify buggy source code snippets with promising results. To evaluate the accuracy of these results and obtain additional insights into the practical applicability of ABLoTS, we conducted a replication study of this approach with the original dataset and also on two extended datasets (i.e., additional Java dataset and Python dataset). The original dataset consists of 11 open source Java projects with 8,494 bug reports. The extended Java dataset includes 16 more projects comprising 25,893 bug reports and corresponding source code commits. The extended Python dataset consists of 12 projects with 1,289 bug reports. While we find that the TraceScore component, which is the core of ABLoTS, produces comparable or even better results with the extended datasets, we also find that we cannot reproduce the ABLoTS results, as reported in its original paper, due to an overlooked side effect of incorrectly choosing a cut-off date that led to test data leaking into training data with significant effects on performance. Additionally, we conduct experiments to assess the performance of various composers that aggregate scores from different components, revealing that Logistic Regression, fixed weight, and CombSUM outperform the other composers across all three datasets, while decision tree and random forest exhibited subpar performance.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对错误定位 ABLoTS 方法的广泛复制研究

错误定位是一项推荐源代码位置（通常是文件）的任务，这些位置包含导致错误的原因，因此需要进行修改以修复错误。按照这种思路，人们采用了基于信息检索的错误定位（IRBL）方法，从源代码空间中识别出最容易出现错误的文件。在目前的实践中，一系列最先进的 IRBL 技术利用不同组件（如类似报告、版本历史和代码结构）的组合来实现更好的性能。ABLoTS 是最近提出的一种方法，其核心组件 TraceScore 利用不同问题报告（即功能请求和错误报告）之间的需求和可追溯性信息来识别存在错误的源代码片段，并取得了良好的效果。为了评估这些结果的准确性并进一步了解 ABLoTS 的实际应用性，我们利用原始数据集和两个扩展数据集（即额外的 Java 数据集和 Python 数据集）对该方法进行了重复研究。原始数据集由 11 个开源 Java 项目和 8,494 份错误报告组成。扩展的 Java 数据集又包括 16 个项目，包含 25,893 份错误报告和相应的源代码提交。扩展 Python 数据集由 12 个项目组成，包含 1,289 份错误报告。我们发现，作为 ABLoTS 核心的 TraceScore 组件在扩展数据集上产生了与 ABLoTS 相当甚至更好的结果，但我们也发现，我们无法重现 ABLoTS 最初论文中报告的结果，原因是我们忽略了错误选择截止日期的副作用，该副作用导致测试数据泄漏到训练数据中，对性能产生了重大影响。此外，我们还进行了实验，以评估汇总来自不同组件的分数的各种合成器的性能，结果显示，在所有三个数据集上，逻辑回归、固定权重和 CombSUM 都优于其他合成器，而决策树和随机森林则表现不佳。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Empirical Software Engineering 工程技术-计算机：软件工程

CiteScore

8.50

自引率

12.20%

发文量

169

审稿时长

>12 weeks

期刊介绍： Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.