{"title":"Understanding Sparse Topical Structure of Short Text via Stochastic Variational-Gibbs Inference","authors":"Tianyi Lin, Siyuan Zhang, Hong Cheng","doi":"10.1145/2983323.2983765","DOIUrl":null,"url":null,"abstract":"With the soaring popularity of online social media like Twitter, analyzing short text has emerged as an increasingly important task which is challenging to classical topic models, as topic sparsity exists in short text. Topic sparsity refers to the observation that individual document usually concentrates on several salient topics, which may be rare in entire corpus. Understanding this sparse topical structure of short text has been recognized as the key ingredient for mining user-generated Web content and social medium, which are featured in the form of extremely short posts and discussions. However, the existing sparsity-enhanced topic models all assume over-complicated generative process, which severely limits their scalability and makes them unable to automatically infer the number of topics from data. In this paper, we propose a probabilistic Bayesian topic model, namely Sparse Dirichlet mixture Topic Model (SparseDTM), based on Indian Buffet Process (IBP) prior, and infer our model on the large text corpora through a novel inference procedure called stochastic variational-Gibbs inference. Unlike prior work, the proposed approach is able to achieve exact sparse topical structure of large short text collections, and automatically identify the number of topics with a good balance between completeness and homogeneity of topic coherence. Experiments on different genres of large text corpora demonstrate that our approach outperforms various existing sparse topic models. The improvement is significant on large-scale collections of short text.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"61 9-10","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2983323.2983765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
With the soaring popularity of online social media like Twitter, analyzing short text has emerged as an increasingly important task which is challenging to classical topic models, as topic sparsity exists in short text. Topic sparsity refers to the observation that individual document usually concentrates on several salient topics, which may be rare in entire corpus. Understanding this sparse topical structure of short text has been recognized as the key ingredient for mining user-generated Web content and social medium, which are featured in the form of extremely short posts and discussions. However, the existing sparsity-enhanced topic models all assume over-complicated generative process, which severely limits their scalability and makes them unable to automatically infer the number of topics from data. In this paper, we propose a probabilistic Bayesian topic model, namely Sparse Dirichlet mixture Topic Model (SparseDTM), based on Indian Buffet Process (IBP) prior, and infer our model on the large text corpora through a novel inference procedure called stochastic variational-Gibbs inference. Unlike prior work, the proposed approach is able to achieve exact sparse topical structure of large short text collections, and automatically identify the number of topics with a good balance between completeness and homogeneity of topic coherence. Experiments on different genres of large text corpora demonstrate that our approach outperforms various existing sparse topic models. The improvement is significant on large-scale collections of short text.