Scalable inference in max-margin topic models

Jun Zhu, Xun Zheng, Li Zhou, Bo Zhang
{"title":"Scalable inference in max-margin topic models","authors":"Jun Zhu, Xun Zheng, Li Zhou, Bo Zhang","doi":"10.1145/2487575.2487658","DOIUrl":null,"url":null,"abstract":"Topic models have played a pivotal role in analyzing large collections of complex data. Besides discovering latent semantics, supervised topic models (STMs) can make predictions on unseen test data. By marrying with advanced learning techniques, the predictive strengths of STMs have been dramatically enhanced, such as max-margin supervised topic models, state-of-the-art methods that integrate max-margin learning with topic models. Though powerful, max-margin STMs have a hard non-smooth learning problem. Existing algorithms rely on solving multiple latent SVM subproblems in an EM-type procedure, which can be too slow to be applicable to large-scale categorization tasks. In this paper, we present a highly scalable approach to building max-margin supervised topic models. Our approach builds on three key innovations: 1) a new formulation of Gibbs max-margin supervised topic models for both multi-class and multi-label classification; 2) a simple ``augment-and-collapse\" Gibbs sampling algorithm without making restricting assumptions on the posterior distributions; 3) an efficient parallel implementation that can easily tackle data sets with hundreds of categories and millions of documents. Furthermore, our algorithm does not need to solve SVM subproblems. Though performing the two tasks of topic discovery and learning predictive models jointly, which significantly improves the classification performance, our methods have comparable scalability as the state-of-the-art parallel algorithms for the standard LDA topic models which perform the single task of topic discovery only. Finally, an open-source implementation is also provided at: http://www.ml-thu.net/~jun/medlda.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"86 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2487575.2487658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19

Abstract

Topic models have played a pivotal role in analyzing large collections of complex data. Besides discovering latent semantics, supervised topic models (STMs) can make predictions on unseen test data. By marrying with advanced learning techniques, the predictive strengths of STMs have been dramatically enhanced, such as max-margin supervised topic models, state-of-the-art methods that integrate max-margin learning with topic models. Though powerful, max-margin STMs have a hard non-smooth learning problem. Existing algorithms rely on solving multiple latent SVM subproblems in an EM-type procedure, which can be too slow to be applicable to large-scale categorization tasks. In this paper, we present a highly scalable approach to building max-margin supervised topic models. Our approach builds on three key innovations: 1) a new formulation of Gibbs max-margin supervised topic models for both multi-class and multi-label classification; 2) a simple ``augment-and-collapse" Gibbs sampling algorithm without making restricting assumptions on the posterior distributions; 3) an efficient parallel implementation that can easily tackle data sets with hundreds of categories and millions of documents. Furthermore, our algorithm does not need to solve SVM subproblems. Though performing the two tasks of topic discovery and learning predictive models jointly, which significantly improves the classification performance, our methods have comparable scalability as the state-of-the-art parallel algorithms for the standard LDA topic models which perform the single task of topic discovery only. Finally, an open-source implementation is also provided at: http://www.ml-thu.net/~jun/medlda.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
最大边际主题模型中的可伸缩推理
主题模型在分析大型复杂数据集方面发挥了关键作用。除了发现潜在语义外,监督主题模型(STMs)还可以对未知的测试数据进行预测。通过与先进的学习技术相结合,stm的预测能力得到了极大的增强,例如最大边际监督主题模型,将最大边际学习与主题模型相结合的最先进的方法。虽然功能强大,但最大边际stm存在一个困难的非平滑学习问题。现有算法依赖于在em型过程中求解多个潜在的SVM子问题,速度太慢,无法适用于大规模的分类任务。在本文中,我们提出了一种高度可扩展的方法来构建最大边际监督主题模型。我们的方法建立在三个关键创新的基础上:1)针对多类和多标签分类的Gibbs最大边际监督主题模型的新公式;2)一种简单的“扩充-坍缩”吉布斯抽样算法,不对后验分布作限制性假设;3)一种高效的并行实现,可以轻松处理包含数百个类别和数百万个文档的数据集。此外,我们的算法不需要解决支持向量机的子问题。虽然我们的方法将主题发现和学习预测模型两项任务联合起来执行,大大提高了分类性能,但我们的方法具有与仅执行单一主题发现任务的标准LDA主题模型的最先进并行算法相当的可扩展性。最后,还提供了一个开源实现:http://www.ml-thu.net/~jun/medlda。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A general bootstrap performance diagnostic Flexible and robust co-regularized multi-domain graph clustering Beyond myopic inference in big data pipelines Constrained stochastic gradient descent for large-scale least squares problem Inferring distant-time location in low-sampling-rate trajectories
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1