利用语义词聚类表示增强主题建模

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining Pub Date : 2019-01-30 DOI:10.1145/3289600.3291032

Felipe Viegas, Sérgio D. Canuto, Christian Gomes, Washington Cunha, T. Rosa, Sabir Ribas, L. Rocha, Marcos André Gonçalves

{"title":"利用语义词聚类表示增强主题建模","authors":"Felipe Viegas, Sérgio D. Canuto, Christian Gomes, Washington Cunha, T. Rosa, Sabir Ribas, L. Rocha, Marcos André Gonçalves","doi":"10.1145/3289600.3291032","DOIUrl":null,"url":null,"abstract":"In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.","PeriodicalId":143253,"journal":{"name":"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":"{\"title\":\"CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling\",\"authors\":\"Felipe Viegas, Sérgio D. Canuto, Christian Gomes, Washington Cunha, T. Rosa, Sabir Ribas, L. Rocha, Marcos André Gonçalves\",\"doi\":\"10.1145/3289600.3291032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.\",\"PeriodicalId\":143253,\"journal\":{\"name\":\"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-01-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"47\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3289600.3291032\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3289600.3291032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

摘要

在本文中，我们通过一种新的基于预训练词嵌入的非概率矩阵分解的文档表示，推进了主题建模的最新技术。具体来说，我们的策略，称为CluWords，利用给定的预训练词嵌入的最接近的词来生成能够在句法和语义信息方面增强文档表示的元词。我们的解决方案的新颖贡献包括:(i)引入了一种新的数据表示，用于基于句法和语义关系的主题建模，这些关系来源于预训练词嵌入空间中计算的距离;(ii)提出了一种新的基于tf - idf的策略，特别是为CluWords加权而开发的策略。在我们广泛的实验评估中，涵盖了12个数据集和8个最先进的基线，我们在几乎所有情况下都超过了(在少数情况下)，与最佳基线相比，收益超过了50%(与一些亚军相比，收益高达80%)。最后，我们证明了我们的方法能够改善自动文本分类任务的文档表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量

期刊最新文献

DAPA: The WSDM 2019 Workshop on Deep Matching in Practical Applications Solving the Sparsity Problem in Recommendations via Cross-Domain Item Embedding Based on Co-Clustering More Than Just Words: Modeling Non-Textual Characteristics of Podcasts Pleasant Route Suggestion based on Color and Object Rates Session details: Session 6: Networks and Social Behavior