Mining topics in documents: standing on the shoulders of big data

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI:10.1145/2623330.2623622

Zhiyuan Chen, B. Liu

{"title":"Mining topics in documents: standing on the shoulders of big data","authors":"Zhiyuan Chen, B. Liu","doi":"10.1145/2623330.2623622","DOIUrl":null,"url":null,"abstract":"Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"147","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2623330.2623622","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 147

Abstract

Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

文档主题挖掘:站在大数据的肩膀上

主题建模已被广泛用于从文档中挖掘主题。然而，主题建模的一个关键弱点是，它需要大量的数据(例如，数千个文档)来提供可靠的统计数据，以生成一致的主题。然而，在实践中，许多文档集合并没有这么多文档。在给定少量文档的情况下，经典主题模型LDA生成的主题非常差。即使有大量的数据，主题模型的无监督学习仍然会产生令人不满意的结果。近年来，人们提出了基于知识的主题模型，该模型要求人类用户提供一些先验的领域知识来指导模型产生更好的主题。我们的研究采用了一种完全不同的方法。我们建议像人类一样学习，也就是说，保留过去学到的结果，并用它们来帮助未来的学习。当面对一个新的任务时，我们首先从过去的学习/建模结果中挖掘一些可靠的(先验的)知识，然后用它来指导模型推理，以产生更连贯的主题。这种方法是可能的，因为大数据随时可以在网络上获得。该算法挖掘了两种形式的知识:必须链接(意思是两个单词应该在同一个主题中)和不能链接(意思是两个单词不应该在同一个主题中)。同时也解决了自动挖掘知识的两个问题，即错误知识和知识的传递性问题。使用来自100个产品领域的评论文档的实验结果表明，所提出的方法比最先进的基线有了显着的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量