STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI:10.1145/3508230.3508247

Maryam Mousavi, Elena Steiner, S. Corman, Scott W. Ruston, Dylan Weber, H. Davulcu

{"title":"STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering","authors":"Maryam Mousavi, Elena Steiner, S. Corman, Scott W. Ruston, Dylan Weber, H. Davulcu","doi":"10.1145/3508230.3508247","DOIUrl":null,"url":null,"abstract":"In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic’s specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508230.3508247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic’s specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

STIF:使用词嵌入和聚类的半监督分类归纳

在本文中，我们使用术语嵌入和聚类方法开发了一个半监督分类法归纳框架，该框架包含来自650个乌克兰相关博客域的145,000篇文章，时间为2010-2020年。我们提取了32,429个名词短语(NPs)，并将这些NPs分成两类:一般/模糊短语(可能出现在任何主题下)和局部/非模糊短语(与主题的细节有关)。我们使用术语表示和聚类方法，使用Silhouette方法将主题/非歧义短语划分为90组。接下来，一个由10名通信科学家组成的团队分析了NP集群，并在其密码本旁边引入了一个两级分类法。在实现94%的编码器间可靠性之后，编码器开始将所有主题/非歧义短语映射到金标准分类法中。我们评估了一系列术语表示和聚类方法使用外在和内在的措施。我们确定使用K-Means的GloVe嵌入在这个真实数据集中获得了最高的性能(即74%的纯度)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

自引率

0.00%

发文量