Unsupervised Multi-Label Document Classification for Large Taxonomies Using Word Embeddings

2019 International Conference on Computational Science and Computational Intelligence (CSCI) Pub Date : 2019-12-01 DOI:10.1109/CSCI49370.2019.00241

Stefan Hirschmeier, J. Melsbach, D. Schoder, Sven Stahlmann

引用次数: 1

Abstract

More and more businesses are in need for metadata for their documents. However, automatic generation for metadata is not easy, as for supervised document classification, a significant amount of labelled training data is needed, which is not always present in the desired amount or quality. Often, documents need to be tagged with a predefined set of company specific keywords that are organized in a taxonomy. We present an unsupervised approach to perform multi-label document classification for large taxonomies using word embeddings and evaluate it with a dataset of a public broadcaster. We point out strengths of the approach compared to supervised classification and statistical approaches like tf-idf.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用词嵌入的大型分类法无监督多标签文档分类

越来越多的企业需要其文档的元数据。然而，元数据的自动生成并不容易，因为对于监督文档分类，需要大量标记的训练数据，这些数据并不总是以期望的数量或质量存在。通常，文档需要使用一组预定义的公司特定关键字进行标记，这些关键字按照分类法组织。我们提出了一种无监督的方法，使用词嵌入对大型分类法进行多标签文档分类，并使用公共广播公司的数据集对其进行评估。我们指出了该方法与监督分类和统计方法(如tf-idf)相比的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 International Conference on Computational Science and Computational Intelligence (CSCI)

自引率

0.00%

发文量