Unsupervised Extreme Multi Label Classification of Stack Overflow Posts

2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE) Pub Date : 2022-05-01 DOI:10.1145/3528588.3528652

Peter Devine, Kelly Blincoe

{"title":"Unsupervised Extreme Multi Label Classification of Stack Overflow Posts","authors":"Peter Devine, Kelly Blincoe","doi":"10.1145/3528588.3528652","DOIUrl":null,"url":null,"abstract":"Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis and understanding of the large amounts of data that come from these communities. One approach to this problem is using extreme multi label classification (XMLC) to predict the topic (or “tag”) of a post from a potentially very large candidate label set. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus “unsupervised”) to assess the potential applicability to other forums or domains in which tag data is not available.We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible. This offers insight into these models’ applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.","PeriodicalId":313397,"journal":{"name":"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3528588.3528652","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis and understanding of the large amounts of data that come from these communities. One approach to this problem is using extreme multi label classification (XMLC) to predict the topic (or “tag”) of a post from a potentially very large candidate label set. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus “unsupervised”) to assess the potential applicability to other forums or domains in which tag data is not available.We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible. This offers insight into these models’ applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

堆栈溢出岗位的无监督极端多标签分类

了解软件论坛帖子的主题，例如StackOverflow上的主题，可以更好地分析和理解来自这些社区的大量数据。解决这个问题的一种方法是使用极端多标签分类(XMLC)从可能非常大的候选标签集中预测文章的主题(或“标签”)。虽然以前的工作已经在具有明确的文本到标签信息的数据上训练了这些模型，但我们评估了未使用此类结构化数据(因此是“无监督的”)训练的嵌入模型的分类能力，以评估其在标签数据不可用的其他论坛或领域的潜在适用性。我们在所有StackOverflow帖子的0.1%上对所有61,662个可能的StackOverflow标签评估了14个无监督预训练模型。我们发现，在未标记的StackExchange数据(即没有标签数据)上部分训练的MPNet模型在该任务中获得了最高的分数，召回分数为0.161 R@1。这些结果告诉我们，当监督训练不可行时，哪些模型最适合用于StackOverflow帖子的XMLC。这提供了对这些模型在类似但不相同的领域(如软件产品论坛)中的适用性的深入了解。这些结果表明，在没有主题数据可用的情况下，使用域内标题-正文或问答对训练嵌入模型可以创建有效的零采样主题分类器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)

自引率

0.00%

发文量

期刊最新文献

GitHub Issue Classification Using BERT-Style Models Story Point Level Classification by Text Level Graph Neural Network Issue Report Classification Using Pre-trained Language Models Identification of Intra-Domain Ambiguity using Transformer-based Machine Learning Predicting Issue Types with seBERT