Towards automatic labeling of exception handling bugs: A case study of 10 years bug-fixing in Apache Hadoop

IF 3.5 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Empirical Software Engineering Pub Date : 2024-06-05 DOI:10.1007/s10664-024-10494-0

Antônio José A. da Silva, Renan G. Vieira, Diego P. P. Mesquita, João Paulo P. Gomes, Lincoln S. Rocha

{"title":"Towards automatic labeling of exception handling bugs: A case study of 10 years bug-fixing in Apache Hadoop","authors":"Antônio José A. da Silva, Renan G. Vieira, Diego P. P. Mesquita, João Paulo P. Gomes, Lincoln S. Rocha","doi":"10.1007/s10664-024-10494-0","DOIUrl":null,"url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Exception handling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHMs) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and AI-based systems), in which the software’s sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs — since it may require an encompassing knowledge of the software’s EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>First, we aim to evaluate the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim to provide a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community’s awareness regarding the importance of EH bugs.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled <span>\\(\\approx 20\\%\\)</span> (943) as EH bugs. We also labeled 2,584 non-EH bugs analyzing their bug-fixing code and creating a dataset composed of 7,100 bug reports. Then, we used word embedding techniques (Bag-of-Words and TF-IDF) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit five classes of ML methods and evaluate them on unseen data. We also evaluated a pre-trained transformer-based model using the complete textual fields. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Our results show that using a pre-trained DistilBERT with a linear layer trained with our proposed dataset can reasonably label EH bugs, achieving ROC-AUC scores of up to 0.88. The combination of NLP and ML traditional techniques achieved ROC-AUC scores of up to 0.74 and recall up to 0.56. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. Considering ROC-AUC as the primary concern, for the majority of ML methods tested, the analysis suggests that keywords alone are not sufficient to characterize reports of EH bugs, although this can change based on other metrics (such as recall and precision) or ML methods (e.g., Random Forest).</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the use of ML techniques, specially transformer-base models, sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"23 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10494-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Context

Exception handling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHMs) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and AI-based systems), in which the software’s sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs — since it may require an encompassing knowledge of the software’s EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions.

Objective

First, we aim to evaluate the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim to provide a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community’s awareness regarding the importance of EH bugs.

Method

We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled \(\approx 20\%\) (943) as EH bugs. We also labeled 2,584 non-EH bugs analyzing their bug-fixing code and creating a dataset composed of 7,100 bug reports. Then, we used word embedding techniques (Bag-of-Words and TF-IDF) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit five classes of ML methods and evaluate them on unseen data. We also evaluated a pre-trained transformer-based model using the complete textual fields. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance.

Results

Our results show that using a pre-trained DistilBERT with a linear layer trained with our proposed dataset can reasonably label EH bugs, achieving ROC-AUC scores of up to 0.88. The combination of NLP and ML traditional techniques achieved ROC-AUC scores of up to 0.74 and recall up to 0.56. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. Considering ROC-AUC as the primary concern, for the majority of ML methods tested, the analysis suggests that keywords alone are not sufficient to characterize reports of EH bugs, although this can change based on other metrics (such as recall and precision) or ML methods (e.g., Random Forest).

Conclusions

To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the use of ML techniques, specially transformer-base models, sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

实现异常处理错误的自动标记：Apache Hadoop 10 年错误修复案例研究

上下文异常处理（EH）错误源于对异常处理机制（EHM）的不正确使用，通常会导致严重后果（如系统宕机、数据丢失和安全风险）。跟踪 EH 错误对当代系统（如基于云和人工智能的系统）尤为重要，因为在这些系统中，软件的复杂逻辑是正确使用 EHM 的额外威胁。此外，错误报告者很少能标记 EH 错误，因为这可能需要对软件的 EH 策略有全面的了解。首先，我们旨在评估自然语言处理（NLP）和机器学习（ML）在多大程度上可用于使用错误报告中的文本字段（如摘要、描述和注释）可靠地标记 EH 错误。其次，我们的目标是提供一个可靠的标签数据集，供社区在未来的工作中使用。总之，我们希望我们的工作能提高社区对 EH 错误重要性的认识。方法我们人工分析了来自 Apache Hadoop 项目四个主要组件的 4,516 份错误报告，其中我们将 943 个错误标记为 EH 错误。我们还标记了 2,584 个非 EH 错误，分析了它们的错误修复代码，并创建了一个由 7,100 份错误报告组成的数据集。然后，我们使用词嵌入技术（Bag-of-Words 和 TF-IDF）来概括错误报告的文本字段。随后，我们使用这些嵌入技术拟合了五类 ML 方法，并在未见数据上对其进行了评估。我们还使用完整的文本字段对预先训练好的基于转换器的模型进行了评估。我们还评估了仅考虑 EH 关键字是否足以实现较高的预测性能。结果我们的结果表明，使用预先训练好的 DistilBERT，再加上使用我们提出的数据集训练好的线性层，可以合理地标注 EH 错误，ROC-AUC 分数高达 0.88。结合使用 NLP 和 ML 传统技术，ROC-AUC 得分高达 0.74，召回率高达 0.56。为了进行合理性检查，我们还评估了仅从关键词中提取嵌入的方法。考虑到 ROC-AUC 是主要关注点，对于大多数测试过的 ML 方法，分析表明仅靠关键词不足以描述 EH Bug 报告的特征，不过根据其他指标（如召回率和精确度）或 ML 方法（如随机森林），情况可能会发生变化。根据我们的研究结果，我们可以得出结论：使用 ML 技术，特别是转换器基础模型，很有希望实现 EH 错误标记任务的自动化。总之，我们希望：(i) 我们的工作将有助于提高人们对 EH 错误的认识；(ii) 我们（公开提供的）数据集将作为基准数据集，为后续工作铺平道路。此外，我们的发现还可用于构建工具，帮助维护者在分流过程中充实 EH 漏洞。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Empirical Software Engineering 工程技术-计算机：软件工程

CiteScore

8.50

自引率

12.20%

发文量

169

审稿时长

>12 weeks

期刊介绍： Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.