最大边际主题模型中的可伸缩推理

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI:10.1145/2487575.2487658

Jun Zhu, Xun Zheng, Li Zhou, Bo Zhang

{"title":"最大边际主题模型中的可伸缩推理","authors":"Jun Zhu, Xun Zheng, Li Zhou, Bo Zhang","doi":"10.1145/2487575.2487658","DOIUrl":null,"url":null,"abstract":"Topic models have played a pivotal role in analyzing large collections of complex data. Besides discovering latent semantics, supervised topic models (STMs) can make predictions on unseen test data. By marrying with advanced learning techniques, the predictive strengths of STMs have been dramatically enhanced, such as max-margin supervised topic models, state-of-the-art methods that integrate max-margin learning with topic models. Though powerful, max-margin STMs have a hard non-smooth learning problem. Existing algorithms rely on solving multiple latent SVM subproblems in an EM-type procedure, which can be too slow to be applicable to large-scale categorization tasks. In this paper, we present a highly scalable approach to building max-margin supervised topic models. Our approach builds on three key innovations: 1) a new formulation of Gibbs max-margin supervised topic models for both multi-class and multi-label classification; 2) a simple ``augment-and-collapse\" Gibbs sampling algorithm without making restricting assumptions on the posterior distributions; 3) an efficient parallel implementation that can easily tackle data sets with hundreds of categories and millions of documents. Furthermore, our algorithm does not need to solve SVM subproblems. Though performing the two tasks of topic discovery and learning predictive models jointly, which significantly improves the classification performance, our methods have comparable scalability as the state-of-the-art parallel algorithms for the standard LDA topic models which perform the single task of topic discovery only. Finally, an open-source implementation is also provided at: http://www.ml-thu.net/~jun/medlda.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"86 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Scalable inference in max-margin topic models\",\"authors\":\"Jun Zhu, Xun Zheng, Li Zhou, Bo Zhang\",\"doi\":\"10.1145/2487575.2487658\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic models have played a pivotal role in analyzing large collections of complex data. Besides discovering latent semantics, supervised topic models (STMs) can make predictions on unseen test data. By marrying with advanced learning techniques, the predictive strengths of STMs have been dramatically enhanced, such as max-margin supervised topic models, state-of-the-art methods that integrate max-margin learning with topic models. Though powerful, max-margin STMs have a hard non-smooth learning problem. Existing algorithms rely on solving multiple latent SVM subproblems in an EM-type procedure, which can be too slow to be applicable to large-scale categorization tasks. In this paper, we present a highly scalable approach to building max-margin supervised topic models. Our approach builds on three key innovations: 1) a new formulation of Gibbs max-margin supervised topic models for both multi-class and multi-label classification; 2) a simple ``augment-and-collapse\\\" Gibbs sampling algorithm without making restricting assumptions on the posterior distributions; 3) an efficient parallel implementation that can easily tackle data sets with hundreds of categories and millions of documents. Furthermore, our algorithm does not need to solve SVM subproblems. Though performing the two tasks of topic discovery and learning predictive models jointly, which significantly improves the classification performance, our methods have comparable scalability as the state-of-the-art parallel algorithms for the standard LDA topic models which perform the single task of topic discovery only. Finally, an open-source implementation is also provided at: http://www.ml-thu.net/~jun/medlda.\",\"PeriodicalId\":20472,\"journal\":{\"name\":\"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining\",\"volume\":\"86 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2487575.2487658\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2487575.2487658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

主题模型在分析大型复杂数据集方面发挥了关键作用。除了发现潜在语义外，监督主题模型(STMs)还可以对未知的测试数据进行预测。通过与先进的学习技术相结合，stm的预测能力得到了极大的增强，例如最大边际监督主题模型，将最大边际学习与主题模型相结合的最先进的方法。虽然功能强大，但最大边际stm存在一个困难的非平滑学习问题。现有算法依赖于在em型过程中求解多个潜在的SVM子问题，速度太慢，无法适用于大规模的分类任务。在本文中，我们提出了一种高度可扩展的方法来构建最大边际监督主题模型。我们的方法建立在三个关键创新的基础上:1)针对多类和多标签分类的Gibbs最大边际监督主题模型的新公式;2)一种简单的“扩充-坍缩”吉布斯抽样算法，不对后验分布作限制性假设;3)一种高效的并行实现，可以轻松处理包含数百个类别和数百万个文档的数据集。此外，我们的算法不需要解决支持向量机的子问题。虽然我们的方法将主题发现和学习预测模型两项任务联合起来执行，大大提高了分类性能，但我们的方法具有与仅执行单一主题发现任务的标准LDA主题模型的最先进并行算法相当的可扩展性。最后，还提供了一个开源实现:http://www.ml-thu.net/~jun/medlda。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Scalable inference in max-margin topic models

Topic models have played a pivotal role in analyzing large collections of complex data. Besides discovering latent semantics, supervised topic models (STMs) can make predictions on unseen test data. By marrying with advanced learning techniques, the predictive strengths of STMs have been dramatically enhanced, such as max-margin supervised topic models, state-of-the-art methods that integrate max-margin learning with topic models. Though powerful, max-margin STMs have a hard non-smooth learning problem. Existing algorithms rely on solving multiple latent SVM subproblems in an EM-type procedure, which can be too slow to be applicable to large-scale categorization tasks. In this paper, we present a highly scalable approach to building max-margin supervised topic models. Our approach builds on three key innovations: 1) a new formulation of Gibbs max-margin supervised topic models for both multi-class and multi-label classification; 2) a simple ``augment-and-collapse" Gibbs sampling algorithm without making restricting assumptions on the posterior distributions; 3) an efficient parallel implementation that can easily tackle data sets with hundreds of categories and millions of documents. Furthermore, our algorithm does not need to solve SVM subproblems. Though performing the two tasks of topic discovery and learning predictive models jointly, which significantly improves the classification performance, our methods have comparable scalability as the state-of-the-art parallel algorithms for the standard LDA topic models which perform the single task of topic discovery only. Finally, an open-source implementation is also provided at: http://www.ml-thu.net/~jun/medlda.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助