基于dirichlet多项混合模型的短文本聚类方法

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI:10.1145/2623330.2623715

Jianhua Yin, Jianyong Wang

{"title":"基于dirichlet多项混合模型的短文本聚类方法","authors":"Jianhua Yin, Jianyong Wang","doi":"10.1145/2623330.2623715","DOIUrl":null,"url":null,"abstract":"Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"443","resultStr":"{\"title\":\"A dirichlet multinomial mixture model-based approach for short text clustering\",\"authors\":\"Jianhua Yin, Jianyong Wang\",\"doi\":\"10.1145/2623330.2623715\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.\",\"PeriodicalId\":20536,\"journal\":{\"name\":\"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining\",\"volume\":\"6 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"443\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2623330.2623715\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2623330.2623715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 443

摘要

随着Twitter、Google+和Facebook等社交媒体的普及，短文本聚类已经成为一项越来越重要的任务。由于其稀疏、高维、大体积的特点，这是一个具有挑战性的问题。本文针对短文本聚类的Dirichlet多项式混合模型(简称GSDMM)提出了一种坍缩的Gibbs抽样算法。我们发现，GSDMM可以自动推断聚类的数量，在聚类结果的完备性和均匀性之间取得很好的平衡，收敛速度快。GSDMM还可以处理短文本的稀疏和高维问题，并可以获得每个聚类的代表词。我们广泛的实验研究表明，GSDMM可以获得明显优于其他三种聚类模型的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A dirichlet multinomial mixture model-based approach for short text clustering

Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量