N-gram over Context

Proceedings of the 25th International Conference on World Wide Web Pub Date : 2016-04-11 DOI:10.1145/2872427.2882981

N. Kawamae

{"title":"N-gram over Context","authors":"N. Kawamae","doi":"10.1145/2872427.2882981","DOIUrl":null,"url":null,"abstract":"Our proposal, $N$-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms $N$-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form $N$-grams over context. We develop a parallelizable inference algorithm, D-NOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"74 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th International Conference on World Wide Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2872427.2882981","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Our proposal, $N$-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms $N$-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form $N$-grams over context. We develop a parallelizable inference algorithm, D-NOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

N-gram over Context

我们的建议，$N$-gram over Context (NOC)，是一个非参数主题模型，旨在帮助我们理解给定的语料库，并应用于许多文本挖掘应用。与其他主题模型一样，NOC将每个文档表示为主题的混合物，并从一个主题生成每个单词。与这些模型不同的是，NOC既关注主题结构作为内部语言结构，也关注N-gram作为外部语言结构。为了提高特定于主题的N-gram的质量，NOC揭示了一个主题树，该主题树捕获给定语料库中主题之间的语义关系作为上下文，并通过提供该主题树上词频的幂律分布来形成N-gram。为了有效地获得这两种语言结构，NOC以统一的方式从给定的语料库中学习它们。通过在每个文档的生成过程中在单词级别访问整个树，NOC使每个文档能够保持主题一致性，并在上下文上形成$N$-grams。我们开发了一种并行推理算法D-NOC，以支持大型数据集。对评论文章/论文/tweet的实验表明，NOC作为生成模型可以有效地发现主题结构和相应的N-grams，并且可以很好地补充人类专家和领域特定知识。在开源分布式机器学习框架的帮助下，D-NOC可以处理大型数据集，同时保持完整的生成模型性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 25th International Conference on World Wide Web

自引率

0.00%

发文量