catatis:一个使用变压器对问题报告进行分类的智能工具

2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE) Pub Date : 2022-03-31 DOI:10.1145/3528588.3528662

M. Izadi

{"title":"catatis:一个使用变压器对问题报告进行分类的智能工具","authors":"M. Izadi","doi":"10.1145/3528588.3528662","DOIUrl":null,"url":null,"abstract":"Users use Issue Tracking Systems to keep track and manage issue reports in their repositories. An issue is a rich source of software information that contains different reports including a problem, a request for new features, or merely a question about the software product. As the number of these issues increases, it becomes harder to manage them manually. Thus, automatic approaches are proposed to help facilitate the management of issue reports. This paper describes CatIss, an automatic Categorizer of Issue reports which is built upon the Transformer-based pre-trained RoBERTa model. CatIss classifies issue reports into three main categories of Bug report, Enhancement/feature request, and Question. First, the datasets provided for the NLBSE tool competition are cleaned and preprocessed. Then, the pre-trained RoBERTa model is fine-tuned on the preprocessed dataset. Evaluating CatIss on about 80 thousand issue reports from GitHub, indicates that it performs very well surpassing the competition baseline, TicketTagger, and achieving 87.2% F1-score (micro average). Additionally, as CatIss is trained on a wide set of repositories, it is a generic prediction model, hence applicable for any unseen software project or projects with little historical data. Scripts for cleaning the datasets, training CatIss and evaluating the model are publicly available. 1","PeriodicalId":313397,"journal":{"name":"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"CatIss: An Intelligent Tool for Categorizing Issues Reports using Transformers\",\"authors\":\"M. Izadi\",\"doi\":\"10.1145/3528588.3528662\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Users use Issue Tracking Systems to keep track and manage issue reports in their repositories. An issue is a rich source of software information that contains different reports including a problem, a request for new features, or merely a question about the software product. As the number of these issues increases, it becomes harder to manage them manually. Thus, automatic approaches are proposed to help facilitate the management of issue reports. This paper describes CatIss, an automatic Categorizer of Issue reports which is built upon the Transformer-based pre-trained RoBERTa model. CatIss classifies issue reports into three main categories of Bug report, Enhancement/feature request, and Question. First, the datasets provided for the NLBSE tool competition are cleaned and preprocessed. Then, the pre-trained RoBERTa model is fine-tuned on the preprocessed dataset. Evaluating CatIss on about 80 thousand issue reports from GitHub, indicates that it performs very well surpassing the competition baseline, TicketTagger, and achieving 87.2% F1-score (micro average). Additionally, as CatIss is trained on a wide set of repositories, it is a generic prediction model, hence applicable for any unseen software project or projects with little historical data. Scripts for cleaning the datasets, training CatIss and evaluating the model are publicly available. 1\",\"PeriodicalId\":313397,\"journal\":{\"name\":\"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3528588.3528662\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3528588.3528662","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

用户使用问题跟踪系统在其存储库中跟踪和管理问题报告。问题是一个丰富的软件信息源，它包含不同的报告，包括一个问题、对新特性的请求，或者仅仅是一个关于软件产品的问题。随着这些问题数量的增加，手动管理它们变得越来越困难。因此，提出了自动化方法来帮助促进问题报告的管理。本文描述了caatiss，一个问题报告的自动分类器，它建立在基于transformer的预训练RoBERTa模型之上。CatIss将问题报告分为三大类:Bug报告、增强/特性请求和问题。首先，为NLBSE工具竞赛提供的数据集被清理和预处理。然后，在预处理数据集上对预训练的RoBERTa模型进行微调。对来自GitHub的约8万份问题报告进行评估，表明它的表现非常好，超过了竞争基准TicketTagger，达到了87.2%的f1得分(微平均)。此外，由于catatis是在广泛的存储库集上训练的，因此它是一个通用的预测模型，因此适用于任何未见过的软件项目或具有很少历史数据的项目。用于清理数据集、训练CatIss和评估模型的脚本是公开的。1

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CatIss: An Intelligent Tool for Categorizing Issues Reports using Transformers

Users use Issue Tracking Systems to keep track and manage issue reports in their repositories. An issue is a rich source of software information that contains different reports including a problem, a request for new features, or merely a question about the software product. As the number of these issues increases, it becomes harder to manage them manually. Thus, automatic approaches are proposed to help facilitate the management of issue reports. This paper describes CatIss, an automatic Categorizer of Issue reports which is built upon the Transformer-based pre-trained RoBERTa model. CatIss classifies issue reports into three main categories of Bug report, Enhancement/feature request, and Question. First, the datasets provided for the NLBSE tool competition are cleaned and preprocessed. Then, the pre-trained RoBERTa model is fine-tuned on the preprocessed dataset. Evaluating CatIss on about 80 thousand issue reports from GitHub, indicates that it performs very well surpassing the competition baseline, TicketTagger, and achieving 87.2% F1-score (micro average). Additionally, as CatIss is trained on a wide set of repositories, it is a generic prediction model, hence applicable for any unseen software project or projects with little historical data. Scripts for cleaning the datasets, training CatIss and evaluating the model are publicly available. 1

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)

自引率

0.00%

发文量

期刊最新文献

GitHub Issue Classification Using BERT-Style Models Story Point Level Classification by Text Level Graph Neural Network Issue Report Classification Using Pre-trained Language Models Identification of Intra-Domain Ambiguity using Transformer-based Machine Learning Predicting Issue Types with seBERT