{"title":"N-gram over Context","authors":"N. Kawamae","doi":"10.1145/2872427.2882981","DOIUrl":null,"url":null,"abstract":"Our proposal, $N$-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms $N$-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form $N$-grams over context. We develop a parallelizable inference algorithm, D-NOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th International Conference on World Wide Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2872427.2882981","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Our proposal, $N$-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms $N$-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form $N$-grams over context. We develop a parallelizable inference algorithm, D-NOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.
我们的建议,$N$-gram over Context (NOC),是一个非参数主题模型,旨在帮助我们理解给定的语料库,并应用于许多文本挖掘应用。与其他主题模型一样,NOC将每个文档表示为主题的混合物,并从一个主题生成每个单词。与这些模型不同的是,NOC既关注主题结构作为内部语言结构,也关注N-gram作为外部语言结构。为了提高特定于主题的N-gram的质量,NOC揭示了一个主题树,该主题树捕获给定语料库中主题之间的语义关系作为上下文,并通过提供该主题树上词频的幂律分布来形成N-gram。为了有效地获得这两种语言结构,NOC以统一的方式从给定的语料库中学习它们。通过在每个文档的生成过程中在单词级别访问整个树,NOC使每个文档能够保持主题一致性,并在上下文上形成$N$-grams。我们开发了一种并行推理算法D-NOC,以支持大型数据集。对评论文章/论文/tweet的实验表明,NOC作为生成模型可以有效地发现主题结构和相应的N-grams,并且可以很好地补充人类专家和领域特定知识。在开源分布式机器学习框架的帮助下,D-NOC可以处理大型数据集,同时保持完整的生成模型性能。