首页 > 最新文献

Proceedings of the 21st ACM international conference on Information and knowledge management最新文献

英文 中文
Modeling topic hierarchies with the recursive chinese restaurant process 用递归中式餐厅流程建模主题层次结构
Joonyeob Kim, Dongwoo Kim, Suin Kim, Alice H. Oh
Topic models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet processes (HDP) are simple solutions to discover topics from a set of unannotated documents. While they are simple and popular, a major shortcoming of LDA and HDP is that they do not organize the topics into a hierarchical structure which is naturally found in many datasets. We introduce the recursive Chinese restaurant process (rCRP) and a nonparametric topic model with rCRP as a prior for discovering a hierarchical topic structure with unbounded depth and width. Unlike previous models for discovering topic hierarchies, rCRP allows the documents to be generated from a mixture over the entire set of topics in the hierarchy. We apply rCRP to a corpus of New York Times articles, a dataset of MovieLens ratings, and a set of Wikipedia articles and show the discovered topic hierarchies. We compare the predictive power of rCRP with LDA, HDP, and nested Chinese restaurant process (nCRP) using heldout likelihood to show that rCRP outperforms the others. We suggest two metrics that quantify the characteristics of a topic hierarchy to compare the discovered topic hierarchies of rCRP and nCRP. The results show that rCRP discovers a hierarchy in which the topics become more specialized toward the leaves, and topics in the immediate family exhibit more affinity than topics beyond the immediate family.
潜在狄利克雷分配(LDA)和层次狄利克雷过程(HDP)等主题模型是从一组未注释文档中发现主题的简单解决方案。虽然LDA和HDP简单而流行,但它们的一个主要缺点是它们没有将主题组织成许多数据集中自然存在的层次结构。本文引入递归中餐馆过程(rCRP)和一个以rCRP为先验的非参数主题模型,用于发现深度和宽度无界的分层主题结构。与以前用于发现主题层次结构的模型不同,rCRP允许从层次结构中整个主题集的混合中生成文档。我们将rCRP应用于《纽约时报》文章的语料库、MovieLens评分的数据集和一组Wikipedia文章,并显示发现的主题层次结构。我们将rCRP的预测能力与LDA、HDP和嵌套中式餐厅流程(nCRP)进行比较,使用空巢似然来显示rCRP优于其他方法。我们提出了两个量化主题层次结构特征的指标,以比较已发现的rCRP和nCRP主题层次结构。结果表明,rCRP发现了一个层次结构,在这个层次结构中,主题向叶方向更加专门化,直系亲属的主题比直系亲属以外的主题表现出更强的亲和力。
{"title":"Modeling topic hierarchies with the recursive chinese restaurant process","authors":"Joonyeob Kim, Dongwoo Kim, Suin Kim, Alice H. Oh","doi":"10.1145/2396761.2396861","DOIUrl":"https://doi.org/10.1145/2396761.2396861","url":null,"abstract":"Topic models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet processes (HDP) are simple solutions to discover topics from a set of unannotated documents. While they are simple and popular, a major shortcoming of LDA and HDP is that they do not organize the topics into a hierarchical structure which is naturally found in many datasets. We introduce the recursive Chinese restaurant process (rCRP) and a nonparametric topic model with rCRP as a prior for discovering a hierarchical topic structure with unbounded depth and width. Unlike previous models for discovering topic hierarchies, rCRP allows the documents to be generated from a mixture over the entire set of topics in the hierarchy. We apply rCRP to a corpus of New York Times articles, a dataset of MovieLens ratings, and a set of Wikipedia articles and show the discovered topic hierarchies. We compare the predictive power of rCRP with LDA, HDP, and nested Chinese restaurant process (nCRP) using heldout likelihood to show that rCRP outperforms the others. We suggest two metrics that quantify the characteristics of a topic hierarchy to compare the discovered topic hierarchies of rCRP and nCRP. The results show that rCRP discovers a hierarchy in which the topics become more specialized toward the leaves, and topics in the immediate family exhibit more affinity than topics beyond the immediate family.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"97 2 Suppl 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116374965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Is wikipedia too difficult?: comparative analysis of readability of wikipedia, simple wikipedia and britannica 维基百科是不是太难了?:维基百科、简易维基百科和大英百科的可读性比较分析
A. Jatowt, Katsumi Tanaka
Readability is one of key factors determining document quality and reader's satisfaction. In this paper we analyze readability of Wikipedia, which is a popular source of information for searchers about unknown topics. Although Wikipedia articles are frequently listed by search engines on top ranks, they are often too difficult for average readers searching information about difficult queries. We examine the average readability of content in Wikipedia and compare it to the one in Simple Wikipedia and Britannica. Next, we investigate readability of selected categories in Wikipedia. Apart from standard readability measures we use some new metrics based on words' popularity and their distributions across different document genres and topics.
可读性是决定文档质量和读者满意度的关键因素之一。在本文中,我们分析了维基百科的可读性,它是搜索未知话题的热门信息来源。尽管维基百科的文章经常被搜索引擎列在最前列,但对于普通读者来说,搜索疑难问题的信息往往太困难了。我们检查了维基百科内容的平均可读性,并将其与简单维基百科和大英百科全书的内容进行比较。接下来,我们调查维基百科中选定类别的可读性。除了标准的可读性度量外,我们还使用了一些基于单词受欢迎程度及其在不同文档类型和主题中的分布的新度量。
{"title":"Is wikipedia too difficult?: comparative analysis of readability of wikipedia, simple wikipedia and britannica","authors":"A. Jatowt, Katsumi Tanaka","doi":"10.1145/2396761.2398703","DOIUrl":"https://doi.org/10.1145/2396761.2398703","url":null,"abstract":"Readability is one of key factors determining document quality and reader's satisfaction. In this paper we analyze readability of Wikipedia, which is a popular source of information for searchers about unknown topics. Although Wikipedia articles are frequently listed by search engines on top ranks, they are often too difficult for average readers searching information about difficult queries. We examine the average readability of content in Wikipedia and compare it to the one in Simple Wikipedia and Britannica. Next, we investigate readability of selected categories in Wikipedia. Apart from standard readability measures we use some new metrics based on words' popularity and their distributions across different document genres and topics.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122005094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Shard ranking and cutoff estimation for topically partitioned collections 主题分区集合的分片排序和截止估计
Anagha Kulkarni, Almer S. Tigelaar, D. Hiemstra, Jamie Callan
Large document collections can be partitioned into 'topical shards' to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.
大型文档集合可以划分为“局部碎片”,以方便分布式搜索。在低资源搜索环境中,只有少数分片可以并行搜索。这样的搜索环境面临着两个相互交织的挑战。首先,确定对于给定的查询要查询哪些分片:分片排名。其次,要从排名中参考多少分片:截止估计。在本文中,我们提出了一组三种算法来解决这两个问题。作为基础,我们使用一种常用的数据结构,即中心样本索引(CSI)来表示分片内容。对CSI运行查询会产生一个平面文档排名,我们的每个算法都将其转换为树结构。使用自底向上的树遍历来推断分片的排名,并估计该排名中的停止点,从而产生具有成本效益的选择性分布式搜索。与最先进的分片排序方法相比,所提出的算法提供了更高的搜索效率,同时提供了相当的搜索效率。
{"title":"Shard ranking and cutoff estimation for topically partitioned collections","authors":"Anagha Kulkarni, Almer S. Tigelaar, D. Hiemstra, Jamie Callan","doi":"10.1145/2396761.2396833","DOIUrl":"https://doi.org/10.1145/2396761.2396833","url":null,"abstract":"Large document collections can be partitioned into 'topical shards' to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125746579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
LINDA: distributed web-of-data-scale entity matching 琳达:分布式数据网络规模的实体匹配
Christoph Böhm, Gerard de Melo, Felix Naumann, G. Weikum
Linked Data has emerged as a powerful way of interconnecting structured data on the Web. However, the cross-linkage between Linked Data sources is not as extensive as one would hope for. In this paper, we formalize the task of automatically creating "sameAs" links across data sources in a globally consistent manner. Our algorithm, presented in a multi-core as well as a distributed version, achieves this link generation by accounting for joint evidence of a match. Experiments confirm that our system scales beyond 100 million entities and delivers highly accurate results despite the vast heterogeneity and daunting scale.
关联数据(Linked Data)作为一种连接Web上结构化数据的强大方式而出现。然而,关联数据源之间的交叉链接并不像人们希望的那样广泛。在本文中,我们以全局一致的方式形式化了跨数据源自动创建“相同”链接的任务。我们的算法在多核和分布式版本中提出,通过考虑匹配的联合证据来实现这种链接生成。实验证实,尽管存在巨大的异质性和令人生畏的规模,我们的系统仍然可以扩展超过1亿个实体,并提供高度准确的结果。
{"title":"LINDA: distributed web-of-data-scale entity matching","authors":"Christoph Böhm, Gerard de Melo, Felix Naumann, G. Weikum","doi":"10.1145/2396761.2398582","DOIUrl":"https://doi.org/10.1145/2396761.2398582","url":null,"abstract":"Linked Data has emerged as a powerful way of interconnecting structured data on the Web. However, the cross-linkage between Linked Data sources is not as extensive as one would hope for. In this paper, we formalize the task of automatically creating \"sameAs\" links across data sources in a globally consistent manner. Our algorithm, presented in a multi-core as well as a distributed version, achieves this link generation by accounting for joint evidence of a match. Experiments confirm that our system scales beyond 100 million entities and delivers highly accurate results despite the vast heterogeneity and daunting scale.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124752113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 94
One seed to find them all: mining opinion features via association 找到它们的一个种子:通过关联挖掘意见特征
Zhen Hai, Kuiyu Chang, G. Cong
Feature-based opinion analysis has attracted extensive attention recently. Identifying features associated with opinions expressed in reviews is essential for fine-grained opinion mining. One approach is to exploit the dependency relations that occur naturally between features and opinion words, and among features (or opinion words) themselves. In this paper, we propose a generalized approach to opinion feature extraction by incorporating robust statistical association analysis in a bootstrapping framework. The new approach starts with a small set of feature seeds, on which it iteratively enlarges by mining feature-opinion, feature-feature, and opinion-opinion dependency relations. Two association model types, namely likelihood ratio tests (LRT) and latent semantic analysis (LSA), are proposed for computing the pair-wise associations between terms (features or opinions). We accordingly propose two robust bootstrapping approaches, LRTBOOT and LSABOOT, both of which need just a handful of initial feature seeds to bootstrap opinion feature extraction. We benchmarked LRTBOOT and LSABOOT against existing approaches on a large number of real-life reviews crawled from the cellphone and hotel domains. Experimental results using varying number of feature seeds show that the proposed association-based bootstrapping approach significantly outperforms the competitors. In fact, one seed feature is all that is needed for LRTBOOT to significantly outperform the other methods. This seed feature can simply be the domain feature, e.g., "cellphone" or "hotel". The consequence of our discovery is far reaching: starting with just one feature seed, typically just the domain concept word, LRTBOOT can automatically extract a large set of high-quality opinion features from the corpus without any supervision or labeled features. This means that the automatic creation of a set of domain features is no longer a pipe dream!
基于特征的意见分析近年来引起了广泛的关注。识别与评论中表达的意见相关的特征对于细粒度意见挖掘至关重要。一种方法是利用特征和意见词之间以及特征(或意见词)本身之间自然发生的依赖关系。在本文中,我们提出了一种将鲁棒统计关联分析结合到自举框架中的广义意见特征提取方法。新方法从一组小的特征种子开始,通过挖掘特征-意见、特征-特征和意见-意见依赖关系,对其进行迭代扩展。提出了两种关联模型类型,即似然比检验(LRT)和潜在语义分析(LSA),用于计算术语(特征或观点)之间的成对关联。因此,我们提出了两种鲁棒的引导方法LRTBOOT和LSABOOT,这两种方法都只需要少量的初始特征种子来引导意见特征提取。我们对LRTBOOT和LSABOOT进行了基准测试,对比现有的方法,从手机和酒店领域抓取了大量真实的评论。使用不同数量的特征种子的实验结果表明,基于关联的自举方法明显优于竞争对手。实际上,LRTBOOT只需要一个种子特性就可以显著优于其他方法。这个种子特征可以是简单的领域特征,例如,“手机”或“酒店”。我们发现的结果是深远的:从一个特征种子开始,通常只是领域概念词,LRTBOOT可以自动从语料库中提取大量高质量的意见特征,而无需任何监督或标记特征。这意味着自动创建一组域特性不再是白日梦!
{"title":"One seed to find them all: mining opinion features via association","authors":"Zhen Hai, Kuiyu Chang, G. Cong","doi":"10.1145/2396761.2396797","DOIUrl":"https://doi.org/10.1145/2396761.2396797","url":null,"abstract":"Feature-based opinion analysis has attracted extensive attention recently. Identifying features associated with opinions expressed in reviews is essential for fine-grained opinion mining. One approach is to exploit the dependency relations that occur naturally between features and opinion words, and among features (or opinion words) themselves. In this paper, we propose a generalized approach to opinion feature extraction by incorporating robust statistical association analysis in a bootstrapping framework. The new approach starts with a small set of feature seeds, on which it iteratively enlarges by mining feature-opinion, feature-feature, and opinion-opinion dependency relations. Two association model types, namely likelihood ratio tests (LRT) and latent semantic analysis (LSA), are proposed for computing the pair-wise associations between terms (features or opinions). We accordingly propose two robust bootstrapping approaches, LRTBOOT and LSABOOT, both of which need just a handful of initial feature seeds to bootstrap opinion feature extraction. We benchmarked LRTBOOT and LSABOOT against existing approaches on a large number of real-life reviews crawled from the cellphone and hotel domains. Experimental results using varying number of feature seeds show that the proposed association-based bootstrapping approach significantly outperforms the competitors. In fact, one seed feature is all that is needed for LRTBOOT to significantly outperform the other methods. This seed feature can simply be the domain feature, e.g., \"cellphone\" or \"hotel\". The consequence of our discovery is far reaching: starting with just one feature seed, typically just the domain concept word, LRTBOOT can automatically extract a large set of high-quality opinion features from the corpus without any supervision or labeled features. This means that the automatic creation of a set of domain features is no longer a pipe dream!","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128700641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
Estimating query difficulty for news prediction retrieval 新闻预测检索查询难度估计
Nattiya Kanhabua, K. Nørvåg
News prediction retrieval has recently emerged as the task of retrieving predictions related to a given news story (or a query). Predictions are defined as sentences containing time references to future events. Such future-related information is crucially important for understanding the temporal development of news stories, as well as strategies planning and risk management. The aforementioned work has been shown to retrieve a significant number of relevant predictions. However, only a certain news topics achieve good retrieval effectiveness. In this paper, we study how to determine the difficulty in retrieving predictions for a given news story. More precisely, we address the query difficulty estimation problem for news prediction retrieval. We propose different entity-based predictors used for classifying queries into two classes, namely, Easy and Difficult. Our prediction model is based on a machine learning approach. Through experiments on real-world data, we show that our proposed approach can predict query difficulty with high accuracy.
新闻预测检索最近作为检索与给定新闻故事(或查询)相关的预测的任务而出现。预测被定义为包含对未来事件的时间参考的句子。这种与未来相关的信息对于理解新闻故事的时间发展,以及战略规划和风险管理至关重要。上述工作已被证明可以检索大量相关预测。然而,只有特定的新闻主题能够达到较好的检索效果。在本文中,我们研究了如何确定检索给定新闻故事预测的难度。更准确地说,我们解决了新闻预测检索的查询难度估计问题。我们提出了不同的基于实体的预测器,用于将查询分为两类,即简单和困难。我们的预测模型是基于机器学习方法。通过对真实数据的实验,我们证明了我们的方法能够以较高的准确率预测查询难度。
{"title":"Estimating query difficulty for news prediction retrieval","authors":"Nattiya Kanhabua, K. Nørvåg","doi":"10.1145/2396761.2398707","DOIUrl":"https://doi.org/10.1145/2396761.2398707","url":null,"abstract":"News prediction retrieval has recently emerged as the task of retrieving predictions related to a given news story (or a query). Predictions are defined as sentences containing time references to future events. Such future-related information is crucially important for understanding the temporal development of news stories, as well as strategies planning and risk management. The aforementioned work has been shown to retrieve a significant number of relevant predictions. However, only a certain news topics achieve good retrieval effectiveness. In this paper, we study how to determine the difficulty in retrieving predictions for a given news story. More precisely, we address the query difficulty estimation problem for news prediction retrieval. We propose different entity-based predictors used for classifying queries into two classes, namely, Easy and Difficult. Our prediction model is based on a machine learning approach. Through experiments on real-world data, we show that our proposed approach can predict query difficulty with high accuracy.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128278585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering Wikipedia infoboxes to discover their types 聚集维基百科信息框以发现它们的类型
T. Nguyen, Huong Nguyen, V. Moreira, J. Freire
Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.
维基百科已经成为网络结构化信息的重要来源。虽然维基百科的成功可以部分归因于添加和修改内容的简单性,但这也给使用、查询和整合信息带来了挑战。尽管鼓励作者选择适当的类别并提供遵循预定义模板的信息框,但许多作者并不遵循指导方针或松散地遵循指导方针。这将导致不良影响,例如模板复制、异构性和模式漂移。为了解决这个问题,我们提出了一种新的无监督方法来聚类维基百科信息框。我们没有依赖于手动分配的类别和模板标签,而是使用信息框中可用的结构化信息对它们进行分组并推断它们的实体类型。使用超过48,000个信息框的实验表明,我们的聚类方法是有效的,并产生了高质量的聚类。
{"title":"Clustering Wikipedia infoboxes to discover their types","authors":"T. Nguyen, Huong Nguyen, V. Moreira, J. Freire","doi":"10.1145/2396761.2398588","DOIUrl":"https://doi.org/10.1145/2396761.2398588","url":null,"abstract":"Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129353956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Diversionary comments under political blog posts 政治博客帖子下的转移注意力的评论
Jing Wang, Clement T. Yu, Philip S. Yu, B. Liu, W. Meng
An important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments under political blog posts are defined as comments that deliberately twist the bloggers' intention and divert the topic to another one. The purpose is to distract readers from the original topic and draw attention to a new topic. Given that political blogs have significant impact on the society, we believe it is imperative to identify such comments. We then categorize diversionary comments into 5 types, and propose an effective technique to rank comments in descending order of being diversionary. To the best of our knowledge, the problem of detecting diversionary comments has not been studied so far. Our evaluation on 2,109 comments under 20 different blog posts from Digg.com shows that the proposed method achieves the high mean average precision (MAP) of 92.6%. Sensitivity analysis indicates that the effectiveness of the method is stable under different parameter settings.
迄今为止被忽视的一个重要问题是识别转移注意力的评论。政治博客文章下的转移性评论是指故意扭曲博主的意图,将话题转移到另一个话题上的评论。目的是将读者的注意力从原来的主题转移到新的主题上。鉴于政治博客对社会有重大影响,我们认为有必要识别这些评论。然后,我们将转移评论分为5种类型,并提出了一种有效的技术,将评论按转移的降序排列。据我们所知,到目前为止,还没有对转移评论的检测问题进行过研究。通过对Digg.com上20篇不同博文下的2109条评论进行评价,结果表明该方法达到了92.6%的平均精度(MAP)。灵敏度分析表明,在不同的参数设置下,该方法的有效性是稳定的。
{"title":"Diversionary comments under political blog posts","authors":"Jing Wang, Clement T. Yu, Philip S. Yu, B. Liu, W. Meng","doi":"10.1145/2396761.2398518","DOIUrl":"https://doi.org/10.1145/2396761.2398518","url":null,"abstract":"An important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments under political blog posts are defined as comments that deliberately twist the bloggers' intention and divert the topic to another one. The purpose is to distract readers from the original topic and draw attention to a new topic. Given that political blogs have significant impact on the society, we believe it is imperative to identify such comments. We then categorize diversionary comments into 5 types, and propose an effective technique to rank comments in descending order of being diversionary. To the best of our knowledge, the problem of detecting diversionary comments has not been studied so far. Our evaluation on 2,109 comments under 20 different blog posts from Digg.com shows that the proposed method achieves the high mean average precision (MAP) of 92.6%. Sensitivity analysis indicates that the effectiveness of the method is stable under different parameter settings.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127289562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Data filtering in humor generation: comparative analysis of hit rate and co-occurrence rankings as a method to choose usable pun candidates 幽默生成中的数据过滤:对比分析命中率和共现排名,选择可用的双关语候选词
Pawel Dybala, Rafal Rzepka, K. Araki, Kohichi Sayama
In this paper we propose a method of filtering excessive amount of textual data acquired from the Internet. In our research on pun generation in Japanese we experienced problems with extensively long data processing time, caused by the amount of phonetic candidates generated (i.e. phrases that can be used to generate actual puns) by our system. Simple, naive approach in which we take into considerations only phrases with the highest occurrence in the Internet, can effect in deletion of those candidates that are actually usable. Thus, we propose a data filtering method in which we compare two Internet-based rankings: a co-occurrence ranking and a hit rate ranking, and select only candidates which occupy the same or similar positions in these rankings. In this work we analyze the effects of such data reduction, considering 1 cases: when the candidates are on exactly the same positions in both rankings, and when their positions differ by 1, 2, 3 and 4. The analysis is conducted on data acquired by comparing pun candidates generated by the system (and filtered with our method) with phrases that were actually used in puns created by humans. The results show that the proposed method can be used to filter excessive amounts of textual data acquired from the Internet.
本文提出了一种过滤从互联网上获取的过多文本数据的方法。在我们对日语双关语生成的研究中,我们遇到了数据处理时间过长的问题,这是由我们的系统生成的大量语音候选者(即可用于生成实际双关语的短语)造成的。简单,天真的方法,我们只考虑在互联网上出现频率最高的短语,可以影响删除那些实际可用的候选词。因此,我们提出了一种数据过滤方法,我们比较两种基于互联网的排名:共现排名和命中率排名,并只选择在这些排名中占据相同或相似位置的候选人。在这项工作中,我们分析了这种数据缩减的影响,考虑了1种情况:当候选人在两个排名中处于完全相同的位置时,以及当他们的位置相差1,2,3和4时。通过比较系统生成的双关语候选词(用我们的方法过滤)和人类创造的双关语中实际使用的短语,对获得的数据进行分析。结果表明,该方法可用于过滤从互联网上获取的大量文本数据。
{"title":"Data filtering in humor generation: comparative analysis of hit rate and co-occurrence rankings as a method to choose usable pun candidates","authors":"Pawel Dybala, Rafal Rzepka, K. Araki, Kohichi Sayama","doi":"10.1145/2396761.2398698","DOIUrl":"https://doi.org/10.1145/2396761.2398698","url":null,"abstract":"In this paper we propose a method of filtering excessive amount of textual data acquired from the Internet. In our research on pun generation in Japanese we experienced problems with extensively long data processing time, caused by the amount of phonetic candidates generated (i.e. phrases that can be used to generate actual puns) by our system. Simple, naive approach in which we take into considerations only phrases with the highest occurrence in the Internet, can effect in deletion of those candidates that are actually usable. Thus, we propose a data filtering method in which we compare two Internet-based rankings: a co-occurrence ranking and a hit rate ranking, and select only candidates which occupy the same or similar positions in these rankings. In this work we analyze the effects of such data reduction, considering 1 cases: when the candidates are on exactly the same positions in both rankings, and when their positions differ by 1, 2, 3 and 4. The analysis is conducted on data acquired by comparing pun candidates generated by the system (and filtered with our method) with phrases that were actually used in puns created by humans. The results show that the proposed method can be used to filter excessive amounts of textual data acquired from the Internet.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126938242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SonetRank: leveraging social networks to personalize search sononerank:利用社交网络进行个性化搜索
Abhijith Kashyap, R. Amini, Vagelis Hristidis
Earlier works on personalized Web search focused on the click-through graphs, while recent works leverage social annotations, which are often unavailable. On the other hand, many users are members of the social networks and subscribe to social groups. Intuitively, users in the same group may have similar relevance judgments for queries related to these groups. SonetRank utilizes this observation to personalize the Web search results based on the aggregate relevance feedback of the users in similar groups. SonetRank builds and maintains a rich graph-based model, termed Social Aware Search Graph, consisting of groups, users, queries and results click-through information. SonetRank's personalization scheme learns in a principled way to leverage the following three signals, of decreasing strength: the personal document preferences of the user, of the users of her social groups relevant to the query, and of the other users in the network. SonetRank also uses a novel approach to measure the amount of personalization with respect to a user and a query, based on the query-specific richness of the user's social profile. We evaluate SonetRank with users on Amazon Mechanical Turk and show a significant improvement in ranking compared to state-of-the-art techniques.
早期关于个性化网络搜索的研究主要集中在点击率图表上,而最近的研究则利用了通常不可用的社交注释。另一方面,许多用户是社交网络的成员,并订阅社交群组。直观地说,同一组中的用户可能对与这些组相关的查询有相似的相关性判断。sonentrank利用这一观察结果,基于相似组中用户的聚合相关性反馈来个性化Web搜索结果。SonetRank构建并维护了一个丰富的基于图形的模型,称为Social Aware Search Graph,由组、用户、查询和结果点击信息组成。SonetRank的个性化方案以一种原则的方式学习利用以下三个信号,它们的强度递减:用户的个人文档偏好,与查询相关的社交群体的用户的偏好,以及网络中其他用户的偏好。SonetRank还使用了一种新颖的方法来衡量用户和查询的个性化程度,该方法基于用户社交资料的特定查询丰富性。我们在Amazon Mechanical Turk上与用户一起评估了sononetrank,结果显示,与最先进的技术相比,sononetrank的排名有了显著提高。
{"title":"SonetRank: leveraging social networks to personalize search","authors":"Abhijith Kashyap, R. Amini, Vagelis Hristidis","doi":"10.1145/2396761.2398569","DOIUrl":"https://doi.org/10.1145/2396761.2398569","url":null,"abstract":"Earlier works on personalized Web search focused on the click-through graphs, while recent works leverage social annotations, which are often unavailable. On the other hand, many users are members of the social networks and subscribe to social groups. Intuitively, users in the same group may have similar relevance judgments for queries related to these groups. SonetRank utilizes this observation to personalize the Web search results based on the aggregate relevance feedback of the users in similar groups. SonetRank builds and maintains a rich graph-based model, termed Social Aware Search Graph, consisting of groups, users, queries and results click-through information. SonetRank's personalization scheme learns in a principled way to leverage the following three signals, of decreasing strength: the personal document preferences of the user, of the users of her social groups relevant to the query, and of the other users in the network. SonetRank also uses a novel approach to measure the amount of personalization with respect to a user and a query, based on the query-specific richness of the user's social profile. We evaluate SonetRank with users on Amazon Mechanical Turk and show a significant improvement in ranking compared to state-of-the-art techniques.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"3 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130523218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
期刊
Proceedings of the 21st ACM international conference on Information and knowledge management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1