首页 > 最新文献

Proceedings of the 25th ACM International on Conference on Information and Knowledge Management最新文献

英文 中文
Improving Entity Ranking for Keyword Queries 改进关键字查询的实体排名
John Foley, Brendan T. O'Connor, J. Allan
Knowledge bases about entities are an important part of modern information retrieval systems. A strong ranking of entities can be used to enhance query understanding and document retrieval or can be presented as another vertical to the user. Given a keyword query, our task is to provide a ranking of the entities present in the collection of interest. We are particularly interested in approaches to this problem that generalize to different knowledge bases and different collections. In the past, this kind of problem has been explored in the enterprise domain through Expert Search. Recently, a dataset was introduced for entity ranking from news and web queries from more general TREC collections. Approaches from prior work leverage a wide variety of lexical resources: e.g., natural language processing and relations in the knowledge base. We address the question of whether we can achieve competitive performance with minimal linguistic resources. We propose a set of features that do not require index-time entity linking, and demonstrate competitive performance on the new dataset. As this paper is the first non-introductory work to leverage this new dataset, we also find and correct certain aspects of the benchmark. To support a fair evaluation, we collect 38% more judgments and contribute annotator agreement information.
实体知识库是现代信息检索系统的重要组成部分。实体的强排序可用于增强查询理解和文档检索,或者可以作为另一个垂直方向呈现给用户。给定一个关键字查询,我们的任务是提供感兴趣集合中存在的实体的排名。我们对解决这个问题的方法特别感兴趣,这些方法可以推广到不同的知识库和不同的集合。在过去,这类问题是通过专家搜索在企业领域进行探索的。最近,引入了一个数据集,用于从来自更一般的TREC集合的新闻和web查询中对实体进行排名。先前工作的方法利用了各种各样的词汇资源:例如,自然语言处理和知识库中的关系。我们讨论的问题是,我们是否可以用最少的语言资源获得有竞争力的表现。我们提出了一组不需要索引时间实体链接的特征,并在新数据集上展示了具有竞争力的性能。由于本文是第一个利用这个新数据集的非介绍性工作,我们还发现并纠正了基准的某些方面。为了支持公平的评价,我们收集了38%以上的判断,并提供了注释者的协议信息。
{"title":"Improving Entity Ranking for Keyword Queries","authors":"John Foley, Brendan T. O'Connor, J. Allan","doi":"10.1145/2983323.2983909","DOIUrl":"https://doi.org/10.1145/2983323.2983909","url":null,"abstract":"Knowledge bases about entities are an important part of modern information retrieval systems. A strong ranking of entities can be used to enhance query understanding and document retrieval or can be presented as another vertical to the user. Given a keyword query, our task is to provide a ranking of the entities present in the collection of interest. We are particularly interested in approaches to this problem that generalize to different knowledge bases and different collections. In the past, this kind of problem has been explored in the enterprise domain through Expert Search. Recently, a dataset was introduced for entity ranking from news and web queries from more general TREC collections. Approaches from prior work leverage a wide variety of lexical resources: e.g., natural language processing and relations in the knowledge base. We address the question of whether we can achieve competitive performance with minimal linguistic resources. We propose a set of features that do not require index-time entity linking, and demonstrate competitive performance on the new dataset. As this paper is the first non-introductory work to leverage this new dataset, we also find and correct certain aspects of the benchmark. To support a fair evaluation, we collect 38% more judgments and contribute annotator agreement information.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115689076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Discovering Entities with Just a Little Help from You 只需一点点帮助就能发现实体
Jaspreet Singh, Johannes Hoffart, Avishek Anand
Linking entities like people, organizations, books, music groups and their songs in text to knowledge bases (KBs) is a fundamental task for many downstream search and mining applications. Achieving high disambiguation accuracy crucially depends on a rich and holistic representation of the entities in the KB. For popular entities, such a representation can be easily mined from Wikipedia, and many current entity disambiguation and linking methods make use of this fact. However, Wikipedia does not contain long-tail entities that only few people are interested in, and also at times lags behind until newly emerging entities are added. For such entities, mining a suitable representation in a fully automated fashion is very difficult, resulting in poor linking accuracy. What can automatically be mined, though, is a high-quality representation given the context of a new entity occurring in any text. Due to the lack of knowledge about the entity, no method can retrieve these occurrences automatically with high precision, resulting in a chicken-egg problem. To address this, our approach automatically generates candidate occurrences of entities, prompting the user for feedback to decide if the occurrence refers to the actual entity in question. This feedback gradually improves the knowledge and allows our methods to provide better candidate suggestions to keep the user engaged. We propose novel human-in-the-loop retrieval methods for generating candidates based on gradient interleaving of diversification and textual relevance approaches. We conducted extensive experiments on the FACC dataset, showing that our approaches convincingly outperform carefully selected baselines in both intrinsic and extrinsic measures while keeping users engaged.
将人、组织、书籍、音乐团体和他们的歌曲等实体以文本形式链接到知识库(KBs)是许多下游搜索和挖掘应用程序的基本任务。实现高消歧准确性关键取决于知识库中实体的丰富和整体表示。对于流行的实体,这种表示可以很容易地从维基百科中挖掘出来,并且许多当前的实体消歧和链接方法都利用了这一事实。然而,维基百科不包含只有少数人感兴趣的长尾实体,而且有时也会滞后,直到新出现的实体被加入。对于这样的实体,以完全自动化的方式挖掘合适的表示是非常困难的,导致链接准确性很差。但是,可以自动挖掘的是给定任何文本中出现的新实体的上下文的高质量表示。由于缺乏对实体的了解,没有任何方法可以高精度地自动检索这些事件,从而导致了一个先有鸡还是先有蛋的问题。为了解决这个问题,我们的方法自动生成实体的候选出现,提示用户反馈,以确定出现是否指的是有问题的实际实体。这种反馈逐渐提高了知识,并允许我们的方法提供更好的候选建议,以保持用户的参与。我们提出了一种新的基于多样化和文本关联的梯度交错生成候选词的人在环检索方法。我们在FACC数据集上进行了广泛的实验,结果表明,我们的方法在保持用户参与度的同时,在内在和外在指标上都令人信服地优于精心选择的基线。
{"title":"Discovering Entities with Just a Little Help from You","authors":"Jaspreet Singh, Johannes Hoffart, Avishek Anand","doi":"10.1145/2983323.2983798","DOIUrl":"https://doi.org/10.1145/2983323.2983798","url":null,"abstract":"Linking entities like people, organizations, books, music groups and their songs in text to knowledge bases (KBs) is a fundamental task for many downstream search and mining applications. Achieving high disambiguation accuracy crucially depends on a rich and holistic representation of the entities in the KB. For popular entities, such a representation can be easily mined from Wikipedia, and many current entity disambiguation and linking methods make use of this fact. However, Wikipedia does not contain long-tail entities that only few people are interested in, and also at times lags behind until newly emerging entities are added. For such entities, mining a suitable representation in a fully automated fashion is very difficult, resulting in poor linking accuracy. What can automatically be mined, though, is a high-quality representation given the context of a new entity occurring in any text. Due to the lack of knowledge about the entity, no method can retrieve these occurrences automatically with high precision, resulting in a chicken-egg problem. To address this, our approach automatically generates candidate occurrences of entities, prompting the user for feedback to decide if the occurrence refers to the actual entity in question. This feedback gradually improves the knowledge and allows our methods to provide better candidate suggestions to keep the user engaged. We propose novel human-in-the-loop retrieval methods for generating candidates based on gradient interleaving of diversification and textual relevance approaches. We conducted extensive experiments on the FACC dataset, showing that our approaches convincingly outperform carefully selected baselines in both intrinsic and extrinsic measures while keeping users engaged.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116777671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Personalized Semantic Word Vectors 个性化语义词向量
J. Ebrahimi, D. Dou
Distributed word representations are able to capture syntactic and semantic regularities in text. In this paper, we present a word representation scheme that incorporates authorship information. While maintaining similarity among related words in the induced distributed space, our word vectors can be effectively used for some text classification tasks too. We build on a log-bilinear document model (lbDm), which extracts document features, and word vectors based on word co-occurrence counts. First, we propose a log-bilinear author model (lbAm), which contains an additional author matrix. We show that by directly learning author feature vectors, as opposed to document vectors, we can learn better word representations for the authorship attribution task. Furthermore, authorship information has been found to be useful for sentiment classification. We enrich the author model with a sentiment tensor, and demonstrate the effectiveness of this hybrid model (lbHm) through our experiments on a movie review-classification dataset.
分布式词表示能够捕获文本中的语法和语义规律。在本文中,我们提出了一个包含作者身份信息的单词表示方案。在保持诱导分布空间中相关词的相似性的同时,我们的词向量也可以有效地用于一些文本分类任务。我们建立了一个对数双线性文档模型(lbDm),它提取文档特征,并基于单词共现计数提取单词向量。首先,我们提出了一个对数双线性作者模型(lbAm),它包含一个额外的作者矩阵。我们表明,通过直接学习作者特征向量,而不是文档向量,我们可以为作者归属任务学习更好的单词表示。此外,作者信息被发现对情感分类很有用。我们用情感张量来丰富作者模型,并通过在电影评论分类数据集上的实验证明了这种混合模型(lbHm)的有效性。
{"title":"Personalized Semantic Word Vectors","authors":"J. Ebrahimi, D. Dou","doi":"10.1145/2983323.2983875","DOIUrl":"https://doi.org/10.1145/2983323.2983875","url":null,"abstract":"Distributed word representations are able to capture syntactic and semantic regularities in text. In this paper, we present a word representation scheme that incorporates authorship information. While maintaining similarity among related words in the induced distributed space, our word vectors can be effectively used for some text classification tasks too. We build on a log-bilinear document model (lbDm), which extracts document features, and word vectors based on word co-occurrence counts. First, we propose a log-bilinear author model (lbAm), which contains an additional author matrix. We show that by directly learning author feature vectors, as opposed to document vectors, we can learn better word representations for the authorship attribution task. Furthermore, authorship information has been found to be useful for sentiment classification. We enrich the author model with a sentiment tensor, and demonstrate the effectiveness of this hybrid model (lbHm) through our experiments on a movie review-classification dataset.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116913325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Towards Time-Discounted Influence Maximization 实现时间贴现影响最大化
Arijit Khan
The classical influence maximization (IM) problem in social networks does not distinguish between whether a campaign gets viral in a week or in a year. From the practical standpoint, however, campaigns for a new technology or an upcoming movie must be spread as quickly as possible, otherwise they will be obsolete. To this end, we formulate and investigate the novel problem of maximizing the time-discounted influence spread in a social network, that is, the campaigner is interested in both "when" and "how likely" a user would be influenced. In particular, we assume that the campaigner has a utility function which monotonically decreases with the time required for a user to get influenced, since the activation of the seed nodes. The problem that we solve in this paper is to maximize the expected aggregated value of this utility function over all network users. This is a novel and relevant problem that, surprisingly, has not been studied before. Time-discounted influence maximization (TDIM), being a generalization of the classical IM, still remains NP-hard. However, our main contribution is to prove the sub-modularity of the objective function for any monotonically decreasing function of time, under a variety of influence cascading models, e.g., the independent cascade, linear threshold, and maximum influence arborescence models, thereby designing approximate algorithms with theoretical performance guarantees. We also illustrate that the existing optimization techniques (e.g., CELF) for influence maximization are more efficient over TDIM. Our experimental results demonstrate the effectiveness of our solutions over several baselines including the classical influence maximization algorithms.
社交网络中经典的影响力最大化(IM)问题并没有区分一个活动是在一周内还是在一年内获得病毒式传播。然而,从实际的角度来看,一项新技术或一部即将上映的电影的宣传活动必须尽快传播,否则它们就会过时。为此,我们制定并研究了在社交网络中最大化时间折扣影响传播的新问题,也就是说,活动参与者对用户“何时”和“有多大可能”受到影响都感兴趣。特别是,我们假设活动参与者有一个效用函数,该函数随着用户受到影响所需的时间单调减少,因为激活了种子节点。本文所要解决的问题是使该效用函数在所有网络用户上的期望聚合值最大化。这是一个新颖而相关的问题,令人惊讶的是,以前从未有人研究过。时间折扣影响最大化(TDIM)作为经典即时影响的推广,仍然是np困难的。然而,我们的主要贡献是证明了目标函数在各种影响级联模型(如独立级联、线性阈值和最大影响树模型)下对任何单调递减的时间函数的子模块性,从而设计出具有理论性能保证的近似算法。我们还说明了现有的影响最大化优化技术(例如,CELF)比TDIM更有效。我们的实验结果证明了我们的解决方案在几个基线上的有效性,包括经典的影响最大化算法。
{"title":"Towards Time-Discounted Influence Maximization","authors":"Arijit Khan","doi":"10.1145/2983323.2983862","DOIUrl":"https://doi.org/10.1145/2983323.2983862","url":null,"abstract":"The classical influence maximization (IM) problem in social networks does not distinguish between whether a campaign gets viral in a week or in a year. From the practical standpoint, however, campaigns for a new technology or an upcoming movie must be spread as quickly as possible, otherwise they will be obsolete. To this end, we formulate and investigate the novel problem of maximizing the time-discounted influence spread in a social network, that is, the campaigner is interested in both \"when\" and \"how likely\" a user would be influenced. In particular, we assume that the campaigner has a utility function which monotonically decreases with the time required for a user to get influenced, since the activation of the seed nodes. The problem that we solve in this paper is to maximize the expected aggregated value of this utility function over all network users. This is a novel and relevant problem that, surprisingly, has not been studied before. Time-discounted influence maximization (TDIM), being a generalization of the classical IM, still remains NP-hard. However, our main contribution is to prove the sub-modularity of the objective function for any monotonically decreasing function of time, under a variety of influence cascading models, e.g., the independent cascade, linear threshold, and maximum influence arborescence models, thereby designing approximate algorithms with theoretical performance guarantees. We also illustrate that the existing optimization techniques (e.g., CELF) for influence maximization are more efficient over TDIM. Our experimental results demonstrate the effectiveness of our solutions over several baselines including the classical influence maximization algorithms.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116920332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Studying the Dark Triad of Personality through Twitter Behavior 通过推特行为研究人格的黑暗三合一
Daniel Preotiuc-Pietro, J. Carpenter, Salvatore Giorgi, L. Ungar
Research into the darker traits of human nature is growing in interest especially in the context of increased social media usage. This allows users to express themselves to a wider online audience. We study the extent to which the standard model of dark personality -- the dark triad -- consisting of narcissism, psychopathy and Machiavellianism, is related to observable Twitter behavior such as platform usage, posted text and profile image choice. Our results show that we can map various behaviors to psychological theory and study new aspects related to social media usage. Finally, we build a machine learning algorithm that predicts the dark triad of personality in out-of-sample users with reliable accuracy.
尤其是在社交媒体使用量增加的背景下,对人性黑暗特征的研究越来越受到关注。这允许用户向更广泛的在线受众表达自己。我们研究了由自恋、精神病和马基雅维利主义组成的黑暗人格的标准模型——黑暗三合一——在多大程度上与可观察到的Twitter行为(如平台使用、发布的文本和个人资料图片选择)相关。我们的研究结果表明,我们可以将各种行为映射到心理学理论中,并研究与社交媒体使用相关的新方面。最后,我们建立了一个机器学习算法,以可靠的精度预测样本外用户的黑暗人格特质。
{"title":"Studying the Dark Triad of Personality through Twitter Behavior","authors":"Daniel Preotiuc-Pietro, J. Carpenter, Salvatore Giorgi, L. Ungar","doi":"10.1145/2983323.2983822","DOIUrl":"https://doi.org/10.1145/2983323.2983822","url":null,"abstract":"Research into the darker traits of human nature is growing in interest especially in the context of increased social media usage. This allows users to express themselves to a wider online audience. We study the extent to which the standard model of dark personality -- the dark triad -- consisting of narcissism, psychopathy and Machiavellianism, is related to observable Twitter behavior such as platform usage, posted text and profile image choice. Our results show that we can map various behaviors to psychological theory and study new aspects related to social media usage. Finally, we build a machine learning algorithm that predicts the dark triad of personality in out-of-sample users with reliable accuracy.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117263793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Efficient Distributed Regular Path Queries on RDF Graphs Using Partial Evaluation 基于部分求值的RDF图的高效分布式规则路径查询
Xin Wang, Junhu Wang, Xiaowang Zhang
We propose an efficient distributed method for answering regular path queries (RPQs) on large-scale RDF graphs using partial evaluation. In local computation, we devise a dynamic programming approach to evaluate local and partial answers of an RPQ on each computing site in parallel. In the assembly phase, an automata-based algorithm is proposed to assemble the partial answers of the RPQ into the final results. The experiments on benchmark RDF graphs show that our method outperforms the state-of-the-art message passing methods by up to an order of magnitude.
我们提出了一种高效的分布式方法,使用部分求值来回答大规模RDF图上的规则路径查询(rpq)。在局部计算中,我们设计了一种动态规划方法来并行计算每个计算站点上RPQ的局部和部分答案。在装配阶段,提出了一种基于自动机的算法,将RPQ的部分答案装配到最终结果中。在基准RDF图上的实验表明,我们的方法比最先进的消息传递方法的性能高出一个数量级。
{"title":"Efficient Distributed Regular Path Queries on RDF Graphs Using Partial Evaluation","authors":"Xin Wang, Junhu Wang, Xiaowang Zhang","doi":"10.1145/2983323.2983877","DOIUrl":"https://doi.org/10.1145/2983323.2983877","url":null,"abstract":"We propose an efficient distributed method for answering regular path queries (RPQs) on large-scale RDF graphs using partial evaluation. In local computation, we devise a dynamic programming approach to evaluate local and partial answers of an RPQ on each computing site in parallel. In the assembly phase, an automata-based algorithm is proposed to assemble the partial answers of the RPQ into the final results. The experiments on benchmark RDF graphs show that our method outperforms the state-of-the-art message passing methods by up to an order of magnitude.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121119538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval 基于聚类检索的版本化文档搜索混合索引
Xin Jin, Daniel Agun, Tao Yang, Qinghao Wu, Yifan Shen, Susen Zhao
The previous two-phase method for searching versioned documents seeks a cost tradeoff by using non-positional information to rank document versions first. The second phase then re-ranks top document versions using positional information with fragment-based index compression. This paper proposes an alternative approach that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and develops a hybrid index structure with adaptive runtime data traversal to speed up Phase 2 search. The hybrid scheme exploits the advantages of forward index and inverted index based on the term characteristics to minimize the time in extracting positional and other feature information during runtime search. This paper compares several indexing and data traversal options with different time and space tradeoffs and describes evaluation results to demonstrate their effectiveness. The experiment results show that the proposed scheme can be up-to about 4x as fast as the previous work on solid state drives while retaining good relevance.
先前搜索版本文档的两阶段方法通过使用非位置信息首先对文档版本进行排序来寻求成本权衡。然后,第二阶段使用基于片段的索引压缩的位置信息对顶级文档版本重新排序。本文提出了一种替代方法,在第一阶段使用基于聚类的检索来快速缩小由版本代表指导的搜索范围,并开发了一种具有自适应运行时数据遍历的混合索引结构来加快第二阶段的搜索速度。该混合方案利用了基于词特征的正索引和倒排索引的优点,最大限度地减少了在运行时搜索过程中提取位置和其他特征信息的时间。本文比较了几种具有不同时间和空间权衡的索引和数据遍历选项,并描述了评估结果来证明它们的有效性。实验结果表明,所提出的方案在保持良好相关性的同时,可以比以前在固态驱动器上的工作快4倍左右。
{"title":"Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval","authors":"Xin Jin, Daniel Agun, Tao Yang, Qinghao Wu, Yifan Shen, Susen Zhao","doi":"10.1145/2983323.2983733","DOIUrl":"https://doi.org/10.1145/2983323.2983733","url":null,"abstract":"The previous two-phase method for searching versioned documents seeks a cost tradeoff by using non-positional information to rank document versions first. The second phase then re-ranks top document versions using positional information with fragment-based index compression. This paper proposes an alternative approach that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and develops a hybrid index structure with adaptive runtime data traversal to speed up Phase 2 search. The hybrid scheme exploits the advantages of forward index and inverted index based on the term characteristics to minimize the time in extracting positional and other feature information during runtime search. This paper compares several indexing and data traversal options with different time and space tradeoffs and describes evaluation results to demonstrate their effectiveness. The experiment results show that the proposed scheme can be up-to about 4x as fast as the previous work on solid state drives while retaining good relevance.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127514849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Memory-Optimized Distributed Graph Processing through Novel Compression Techniques 通过新颖的压缩技术实现内存优化的分布式图处理
Panagiotis Liakos, Katia Papakonstantinopoulou, A. Delis
A multitude of contemporary applications now involve graph data whose size continuously grows and this trend shows no signs of subsiding. This has caused the emergence of many distributed graph processing systems including Pregel and Apache Giraph. However, the unprecedented scale now reached by real-world graphs hardens the task of graph processing even in distributed environments and the current memory usage patterns rapidly become a primary concern for such contemporary graph processing systems. We seek to address this challenge by exploiting empirically-observed properties demonstrated by graphs that are generated by human activity. In this paper, we propose three space-efficient adjacency list representations that can be applied to any distributed graph processing system. Our suggested compact representations reduce respective memory requirements for accommodating the graph elements up to 5 times if compared with state-of-the-art methods. At the same time, our memory-optimized methods retain the efficiency of uncompressed structures and enable the execution of algorithms for large scale graphs in settings where contemporary alternative structures fail due to memory errors.
许多当代应用程序现在都涉及图形数据,其规模不断增长,而且这种趋势没有减弱的迹象。这导致了许多分布式图形处理系统的出现,包括Pregel和Apache Giraph。然而,现实世界中图形所达到的前所未有的规模,使得即使在分布式环境中,图形处理的任务也变得更加艰巨,当前的内存使用模式迅速成为当代图形处理系统的主要关注点。我们试图通过利用人类活动生成的图形所展示的经验观察属性来解决这一挑战。在本文中,我们提出了三种空间高效的邻接表表示,可以应用于任何分布式图处理系统。与最先进的方法相比,我们建议的紧凑表示减少了容纳图形元素的内存需求,最多可达5倍。与此同时,我们的内存优化方法保留了未压缩结构的效率,并使算法能够在当代替代结构因内存错误而失败的情况下执行大规模图形。
{"title":"Memory-Optimized Distributed Graph Processing through Novel Compression Techniques","authors":"Panagiotis Liakos, Katia Papakonstantinopoulou, A. Delis","doi":"10.1145/2983323.2983687","DOIUrl":"https://doi.org/10.1145/2983323.2983687","url":null,"abstract":"A multitude of contemporary applications now involve graph data whose size continuously grows and this trend shows no signs of subsiding. This has caused the emergence of many distributed graph processing systems including Pregel and Apache Giraph. However, the unprecedented scale now reached by real-world graphs hardens the task of graph processing even in distributed environments and the current memory usage patterns rapidly become a primary concern for such contemporary graph processing systems. We seek to address this challenge by exploiting empirically-observed properties demonstrated by graphs that are generated by human activity. In this paper, we propose three space-efficient adjacency list representations that can be applied to any distributed graph processing system. Our suggested compact representations reduce respective memory requirements for accommodating the graph elements up to 5 times if compared with state-of-the-art methods. At the same time, our memory-optimized methods retain the efficiency of uncompressed structures and enable the execution of algorithms for large scale graphs in settings where contemporary alternative structures fail due to memory errors.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123271566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Forecasting Geo-sensor Data with Participatory Sensing Based on Dropout Neural Network 基于Dropout神经网络的参与式地理传感器数据预测
Jyun-Yu Jiang, Cheng-te Li
Nowadays, geosensor data, such as air quality and traffic flow, have become more and more essential in people's daily life. However, installing geosensors or hiring volunteers at every location and every time is so expensive. Some organizations may have only few facilities or limited budget to sense these data. Moreover, people usually tend to know the forecast instead of ongoing observations, but the number of sensors (or volunteers) will be a hurdle to make precise prediction. In this paper, we propose a novel concept to forecast geosensor data with participatory sensing. Given a limited number of sensors or volunteers, participatory sensing assumes each of them can observe and collect data at different locations and at different time. By aggregating these sparse data observations in the past time, we propose a neural network based approach to forecast the future geosensor data in any location of an urban area. The extensive experiments have been conducted with large-scale datasets of the air quality in three cities and the traffic of bike sharing systems in two cities. Experimental results show that our predictive model can precisely forecast the air quality and the bike rentle traffic as geosensor data.
如今,地理传感器数据,如空气质量和交通流量,在人们的日常生活中变得越来越重要。然而,在每个地点和每个时间安装地理传感器或雇用志愿者是如此昂贵。一些组织可能只有很少的设备或有限的预算来感知这些数据。此外,人们通常倾向于知道预测,而不是持续的观察,但传感器(或志愿者)的数量将成为做出精确预测的障碍。在本文中,我们提出了一种新的概念来预测地理传感器数据与参与式传感。由于传感器或志愿者的数量有限,参与式传感假设每个传感器或志愿者都可以在不同的地点和不同的时间观察和收集数据。通过汇总这些过去的稀疏数据观测,我们提出了一种基于神经网络的方法来预测未来城市地区任何位置的地理传感器数据。广泛的实验已经在三个城市的空气质量和两个城市的自行车共享系统的交通进行了大规模的数据集。实验结果表明,该预测模型能较准确地预测城市空气质量和自行车租赁流量。
{"title":"Forecasting Geo-sensor Data with Participatory Sensing Based on Dropout Neural Network","authors":"Jyun-Yu Jiang, Cheng-te Li","doi":"10.1145/2983323.2983902","DOIUrl":"https://doi.org/10.1145/2983323.2983902","url":null,"abstract":"Nowadays, geosensor data, such as air quality and traffic flow, have become more and more essential in people's daily life. However, installing geosensors or hiring volunteers at every location and every time is so expensive. Some organizations may have only few facilities or limited budget to sense these data. Moreover, people usually tend to know the forecast instead of ongoing observations, but the number of sensors (or volunteers) will be a hurdle to make precise prediction. In this paper, we propose a novel concept to forecast geosensor data with participatory sensing. Given a limited number of sensors or volunteers, participatory sensing assumes each of them can observe and collect data at different locations and at different time. By aggregating these sparse data observations in the past time, we propose a neural network based approach to forecast the future geosensor data in any location of an urban area. The extensive experiments have been conducted with large-scale datasets of the air quality in three cities and the traffic of bike sharing systems in two cities. Experimental results show that our predictive model can precisely forecast the air quality and the bike rentle traffic as geosensor data.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126763284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Generalizing Translation Models in the Probabilistic Relevance Framework 概率关联框架下的泛化翻译模型
Navid Rekabsaz, M. Lupu, A. Hanbury, G. Zuccon
A recurring question in information retrieval is whether term associations can be properly integrated in traditional information retrieval models while preserving their robustness and effectiveness. In this paper, we revisit a wide spectrum of existing models (Pivoted Document Normalization, BM25, BM25 Verboseness Aware, Multi-Aspect TF, and Language Modelling) by introducing a generalisation of the idea of the translation model. This generalisation is a de facto transformation of the translation models from Language Modelling to the probabilistic models. In doing so, we observe a potential limitation of these generalised translation models: they only affect the term frequency based components of all the models, ignoring changes in document and collection statistics. We correct this limitation by extending the translation models with the 15 statistics of term associations and provide extensive experimental results to demonstrate the benefit of the newly proposed methods. Additionally, we compare the translation models with query expansion methods based on the same term association resources, as well as based on Pseudo-Relevance Feedback (PRF). We observe that translation models always outperform the first, but provide complementary information with the second, such that by using PRF and our translation models together we observe results better than the current state of the art.
在信息检索中,一个反复出现的问题是,术语关联能否在保持传统信息检索模型鲁棒性和有效性的同时,适当地集成到传统的信息检索模型中。在本文中,我们通过引入翻译模型思想的概括,重新审视了广泛的现有模型(pivot Document Normalization, BM25, BM25 verbose Aware, Multi-Aspect TF和语言建模)。这种概括实际上是翻译模型从语言模型到概率模型的转换。在这样做的过程中,我们观察到这些泛化翻译模型的一个潜在限制:它们只影响所有模型中基于术语频率的成分,忽略了文档和集合统计的变化。我们通过扩展具有15个术语关联统计的翻译模型来纠正这一限制,并提供了广泛的实验结果来证明新提出的方法的好处。此外,我们将翻译模型与基于相同术语关联资源的查询扩展方法以及基于伪相关反馈(PRF)的查询扩展方法进行了比较。我们观察到,翻译模型总是优于第一种模型,但提供了与第二种模型互补的信息,因此,通过将PRF和我们的翻译模型一起使用,我们观察到的结果比目前的技术水平更好。
{"title":"Generalizing Translation Models in the Probabilistic Relevance Framework","authors":"Navid Rekabsaz, M. Lupu, A. Hanbury, G. Zuccon","doi":"10.1145/2983323.2983833","DOIUrl":"https://doi.org/10.1145/2983323.2983833","url":null,"abstract":"A recurring question in information retrieval is whether term associations can be properly integrated in traditional information retrieval models while preserving their robustness and effectiveness. In this paper, we revisit a wide spectrum of existing models (Pivoted Document Normalization, BM25, BM25 Verboseness Aware, Multi-Aspect TF, and Language Modelling) by introducing a generalisation of the idea of the translation model. This generalisation is a de facto transformation of the translation models from Language Modelling to the probabilistic models. In doing so, we observe a potential limitation of these generalised translation models: they only affect the term frequency based components of all the models, ignoring changes in document and collection statistics. We correct this limitation by extending the translation models with the 15 statistics of term associations and provide extensive experimental results to demonstrate the benefit of the newly proposed methods. Additionally, we compare the translation models with query expansion methods based on the same term association resources, as well as based on Pseudo-Relevance Feedback (PRF). We observe that translation models always outperform the first, but provide complementary information with the second, such that by using PRF and our translation models together we observe results better than the current state of the art.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122091043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
期刊
Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1