Stefan Hirschmeier, J. Melsbach, D. Schoder, Sven Stahlmann
{"title":"使用词嵌入的大型分类法无监督多标签文档分类","authors":"Stefan Hirschmeier, J. Melsbach, D. Schoder, Sven Stahlmann","doi":"10.1109/CSCI49370.2019.00241","DOIUrl":null,"url":null,"abstract":"More and more businesses are in need for metadata for their documents. However, automatic generation for metadata is not easy, as for supervised document classification, a significant amount of labelled training data is needed, which is not always present in the desired amount or quality. Often, documents need to be tagged with a predefined set of company specific keywords that are organized in a taxonomy. We present an unsupervised approach to perform multi-label document classification for large taxonomies using word embeddings and evaluate it with a dataset of a public broadcaster. We point out strengths of the approach compared to supervised classification and statistical approaches like tf-idf.","PeriodicalId":103662,"journal":{"name":"2019 International Conference on Computational Science and Computational Intelligence (CSCI)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Unsupervised Multi-Label Document Classification for Large Taxonomies Using Word Embeddings\",\"authors\":\"Stefan Hirschmeier, J. Melsbach, D. Schoder, Sven Stahlmann\",\"doi\":\"10.1109/CSCI49370.2019.00241\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"More and more businesses are in need for metadata for their documents. However, automatic generation for metadata is not easy, as for supervised document classification, a significant amount of labelled training data is needed, which is not always present in the desired amount or quality. Often, documents need to be tagged with a predefined set of company specific keywords that are organized in a taxonomy. We present an unsupervised approach to perform multi-label document classification for large taxonomies using word embeddings and evaluate it with a dataset of a public broadcaster. We point out strengths of the approach compared to supervised classification and statistical approaches like tf-idf.\",\"PeriodicalId\":103662,\"journal\":{\"name\":\"2019 International Conference on Computational Science and Computational Intelligence (CSCI)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Computational Science and Computational Intelligence (CSCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CSCI49370.2019.00241\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Computational Science and Computational Intelligence (CSCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSCI49370.2019.00241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Unsupervised Multi-Label Document Classification for Large Taxonomies Using Word Embeddings
More and more businesses are in need for metadata for their documents. However, automatic generation for metadata is not easy, as for supervised document classification, a significant amount of labelled training data is needed, which is not always present in the desired amount or quality. Often, documents need to be tagged with a predefined set of company specific keywords that are organized in a taxonomy. We present an unsupervised approach to perform multi-label document classification for large taxonomies using word embeddings and evaluate it with a dataset of a public broadcaster. We point out strengths of the approach compared to supervised classification and statistical approaches like tf-idf.