首页 > 最新文献

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval最新文献

英文 中文
Analysis of the Paragraph Vector Model for Information Retrieval 面向信息检索的段落向量模型分析
Qingyao Ai, Liu Yang, Jiafeng Guo, W. Bruce Croft
Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.
以往的研究表明,通过神经嵌入模型可以获得词和文本的语义意义表征。特别是段落向量(PV)模型通过估计文档(主题)级语言模型在一些自然语言处理任务中表现出令人印象深刻的性能。然而,将PV模型与传统的语言模型方法集成在一起进行检索,会产生不稳定的性能和有限的改进。在本文中,我们正式讨论了原PV模型在检索任务中限制其性能的三个内在问题。我们还描述了对模型的修改,使其更适合红外任务,并通过实验和案例研究展示了它们的影响。我们解决的三个问题是:(1)不规范的PV训练过程容易产生短文档过拟合,从而在最终的检索模型中产生长度偏差;(2)基于语料库的PV负抽样导致单词权重方案过度抑制了频繁词的重要性;(3)由于缺乏词-上下文信息,使得PV无法捕捉词替换关系。
{"title":"Analysis of the Paragraph Vector Model for Information Retrieval","authors":"Qingyao Ai, Liu Yang, Jiafeng Guo, W. Bruce Croft","doi":"10.1145/2970398.2970409","DOIUrl":"https://doi.org/10.1145/2970398.2970409","url":null,"abstract":"Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123629949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
Embedding-based Query Language Models 基于嵌入的查询语言模型
Hamed Zamani, W. Bruce Croft
Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.
词嵌入是词汇表术语的低维向量表示,可以捕获它们之间的语义相似性,最近在许多自然语言处理任务中显示出令人印象深刻的性能。然而,词嵌入在信息检索中的应用研究才刚刚开始。在本文中,我们探索了在特别检索任务中使用词嵌入来提高查询语言模型的准确性。为此,我们建议使用词嵌入来合并和加权不出现在查询中,但在语义上与查询术语相关的术语。我们用不同的假设描述了两个基于嵌入的查询扩展模型。由于伪相关反馈方法使用顶部检索的文档来更新原始查询模型是众所周知的有效方法,因此我们还开发了基于嵌入的相关模型,这是有效且鲁棒的相关模型方法的扩展。在这些模型中,我们将广泛使用的余弦相似度得到的相似度值与s型函数进行转换,得到更具判别性的语义相似度值。我们使用三个TREC新闻专线和网络集合来评估我们提出的方法。实验结果表明,在大多数情况下,基于嵌入的方法明显优于竞争基线。基于嵌入的方法也被证明比基线方法更健壮。
{"title":"Embedding-based Query Language Models","authors":"Hamed Zamani, W. Bruce Croft","doi":"10.1145/2970398.2970405","DOIUrl":"https://doi.org/10.1145/2970398.2970405","url":null,"abstract":"Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115770565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 130
Estimating Embedding Vectors for Queries 估计查询的嵌入向量
Hamed Zamani, W. Bruce Croft
The dense vector representation of vocabulary terms, also known as word embeddings, have been shown to be highly effective in many natural language processing tasks. Word embeddings have recently begun to be studied in a number of information retrieval (IR) tasks. One of the main steps in leveraging word embeddings for IR tasks is to estimate the embedding vectors of queries. This is a challenging task, since queries are not always available during the training phase of word embedding vectors. Previous work has considered the average or sum of embedding vectors of all query terms (AWE) to model the query embedding vectors, but no theoretical justification has been presented for such a model. In this paper, we propose a theoretical framework for estimating query embedding vectors based on the individual embedding vectors of vocabulary terms. We then provide a number of different implementations of this framework and show that the AWE method is a special case of the proposed framework. We also introduce pseudo query vectors, the query embedding vectors estimated using pseudo-relevant documents. We further extrinsically evaluate the proposed methods using two well-known IR tasks: query expansion and query classification. The estimated query embedding vectors are evaluated via query expansion experiments over three newswire and web TREC collections as well as query classification experiments over the KDD Cup 2005 test set. The experiments show that the introduced pseudo query vectors significantly outperform the AWE method.
词汇术语的密集向量表示,也称为词嵌入,已被证明在许多自然语言处理任务中是非常有效的。近年来,词嵌入在许多信息检索(IR)任务中得到了研究。在IR任务中利用词嵌入的主要步骤之一是估计查询的嵌入向量。这是一项具有挑战性的任务,因为在词嵌入向量的训练阶段,查询并不总是可用的。以前的工作考虑了所有查询项的嵌入向量的平均值或总和(AWE)来建模查询嵌入向量,但没有为这种模型提出理论依据。在本文中,我们提出了一个基于词汇词的单个嵌入向量估计查询嵌入向量的理论框架。然后,我们提供了该框架的许多不同实现,并表明AWE方法是所提议框架的特殊情况。我们还引入了伪查询向量,即使用伪相关文档估计的查询嵌入向量。我们使用两个众所周知的IR任务:查询扩展和查询分类进一步从外部评估所提出的方法。通过三个新闻线和web TREC集合上的查询扩展实验以及KDD Cup 2005测试集上的查询分类实验来评估估计的查询嵌入向量。实验表明,引入的伪查询向量明显优于AWE方法。
{"title":"Estimating Embedding Vectors for Queries","authors":"Hamed Zamani, W. Bruce Croft","doi":"10.1145/2970398.2970403","DOIUrl":"https://doi.org/10.1145/2970398.2970403","url":null,"abstract":"The dense vector representation of vocabulary terms, also known as word embeddings, have been shown to be highly effective in many natural language processing tasks. Word embeddings have recently begun to be studied in a number of information retrieval (IR) tasks. One of the main steps in leveraging word embeddings for IR tasks is to estimate the embedding vectors of queries. This is a challenging task, since queries are not always available during the training phase of word embedding vectors. Previous work has considered the average or sum of embedding vectors of all query terms (AWE) to model the query embedding vectors, but no theoretical justification has been presented for such a model. In this paper, we propose a theoretical framework for estimating query embedding vectors based on the individual embedding vectors of vocabulary terms. We then provide a number of different implementations of this framework and show that the AWE method is a special case of the proposed framework. We also introduce pseudo query vectors, the query embedding vectors estimated using pseudo-relevant documents. We further extrinsically evaluate the proposed methods using two well-known IR tasks: query expansion and query classification. The estimated query embedding vectors are evaluated via query expansion experiments over three newswire and web TREC collections as well as query classification experiments over the KDD Cup 2005 test set. The experiments show that the introduced pseudo query vectors significantly outperform the AWE method.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131582009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 104
Bag-of-Entities Representation for Ranking 用于排序的实体袋表示
Chenyan Xiong, Jamie Callan, Tie-Yan Liu
This paper presents a new bag-of-entities representation for document ranking, with the help of modern knowledge bases and automatic entity linking. Our system represents query and documents by bag-of-entities vectors constructed from their entity annotations, and ranks documents by their matches with the query in the entity space. Our experiments with Freebase on TREC Web Track datasets demonstrate that current entity linking systems can provide sufficient coverage of the general domain search task, and that bag-of-entities representations outperform bag-of-words by as much as 18% in standard document ranking tasks.
本文利用现代知识库和自动实体链接技术,提出了一种新的实体袋表示方法。我们的系统通过实体标注构建实体袋向量来表示查询和文档,并根据文档在实体空间中与查询的匹配程度对文档进行排序。我们在TREC Web Track数据集上使用Freebase进行的实验表明,当前的实体链接系统可以为一般领域搜索任务提供足够的覆盖范围,并且实体袋表示在标准文档排序任务中比词袋表示高出18%。
{"title":"Bag-of-Entities Representation for Ranking","authors":"Chenyan Xiong, Jamie Callan, Tie-Yan Liu","doi":"10.1145/2970398.2970423","DOIUrl":"https://doi.org/10.1145/2970398.2970423","url":null,"abstract":"This paper presents a new bag-of-entities representation for document ranking, with the help of modern knowledge bases and automatic entity linking. Our system represents query and documents by bag-of-entities vectors constructed from their entity annotations, and ranks documents by their matches with the query in the entity space. Our experiments with Freebase on TREC Web Track datasets demonstrate that current entity linking systems can provide sufficient coverage of the general domain search task, and that bag-of-entities representations outperform bag-of-words by as much as 18% in standard document ranking tasks.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"94 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130721504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Estimating Retrieval Performance Bound for Single Term Queries 估计单词查询的检索性能边界
Peilin Yang, Hui Fang
Various information retrieval models have been studied for decades. Most traditional retrieval models are based on bag-of-termrepresentations, and they model the relevance based on various collection statistics. Despite these efforts, it seems that the performance of "bag-of-term" based retrieval functions has reached plateau, and it becomes increasingly difficult to further improve the retrieval performance. Thus, one important research question is whether we can provide any theoretical justifications on the empirical performance bound of basic retrieval functions. In this paper, we start with single term queries, and aim to estimate the performance bound of retrieval functions that leverage only basic ranking signals such as document term frequency, inverse document frequency and document length normalization. Specifically, we demonstrate that, when only single-term queries are considered, there is a general function that can cover many basic retrieval functions. We then propose to estimate the upper bound performance of this function by applying a cost/gain analysis to search for the optimal value of the function.
各种信息检索模型已经被研究了几十年。大多数传统的检索模型是基于术语袋表示的,它们基于各种集合统计数据对相关性进行建模。尽管做出了这些努力,但基于“词袋”的检索函数的性能似乎已经达到了平台期,进一步提高检索性能变得越来越困难。因此,一个重要的研究问题是我们能否对基本检索函数的经验性能界提供任何理论依据。在本文中,我们从单词查询开始,目的是估计仅利用基本排序信号(如文档词频率、逆文档频率和文档长度归一化)的检索函数的性能界限。具体来说,我们演示了当只考虑单项查询时,有一个通用函数可以涵盖许多基本检索函数。然后,我们建议通过应用成本/收益分析来搜索函数的最优值来估计该函数的上界性能。
{"title":"Estimating Retrieval Performance Bound for Single Term Queries","authors":"Peilin Yang, Hui Fang","doi":"10.1145/2970398.2970428","DOIUrl":"https://doi.org/10.1145/2970398.2970428","url":null,"abstract":"Various information retrieval models have been studied for decades. Most traditional retrieval models are based on bag-of-termrepresentations, and they model the relevance based on various collection statistics. Despite these efforts, it seems that the performance of \"bag-of-term\" based retrieval functions has reached plateau, and it becomes increasingly difficult to further improve the retrieval performance. Thus, one important research question is whether we can provide any theoretical justifications on the empirical performance bound of basic retrieval functions. In this paper, we start with single term queries, and aim to estimate the performance bound of retrieval functions that leverage only basic ranking signals such as document term frequency, inverse document frequency and document length normalization. Specifically, we demonstrate that, when only single-term queries are considered, there is a general function that can cover many basic retrieval functions. We then propose to estimate the upper bound performance of this function by applying a cost/gain analysis to search for the optimal value of the function.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129757109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Topic Set Size Design and Power Analysis in Practice 主题集大小设计与功效分析的实践
T. Sakai
Topic set size design methods provide principles and procedures for test collection builders to decide on the number of topics to create. These methods can then help us keep improving the test collection design based on accumulated data. Simple Excel tools are available for such purposes. Post-hoc power analysis tools, available as simple R scripts, can help IR researchers examine the achieved power of a reported experiment and determine future sample sizes for ensuring high power. Thus, for example, underpowered user experiments can be detected, and a larger sample size can be proposed. If used appropriately, these Excel and R tools should be able to provide the IR community with better experimentation practices. The main objective of this tutorial is to let IR researchers familiarise themselves with these tools and understand the basic ideas behind them.
主题集大小设计方法为测试集合构建者决定要创建的主题数量提供了原则和过程。这些方法可以帮助我们在积累数据的基础上不断改进测试集设计。简单的Excel工具可用于此目的。事后功率分析工具,作为简单的R脚本,可以帮助IR研究人员检查所报告的实验的实现功率,并确定未来的样本量,以确保高功率。因此,例如,可以检测到动力不足的用户实验,并且可以提出更大的样本量。如果使用得当,这些Excel和R工具应该能够为IR社区提供更好的实验实践。本教程的主要目的是让IR研究人员熟悉这些工具并理解它们背后的基本思想。
{"title":"Topic Set Size Design and Power Analysis in Practice","authors":"T. Sakai","doi":"10.1145/2970398.2970443","DOIUrl":"https://doi.org/10.1145/2970398.2970443","url":null,"abstract":"Topic set size design methods provide principles and procedures for test collection builders to decide on the number of topics to create. These methods can then help us keep improving the test collection design based on accumulated data. Simple Excel tools are available for such purposes. Post-hoc power analysis tools, available as simple R scripts, can help IR researchers examine the achieved power of a reported experiment and determine future sample sizes for ensuring high power. Thus, for example, underpowered user experiments can be detected, and a larger sample size can be proposed. If used appropriately, these Excel and R tools should be able to provide the IR community with better experimentation practices. The main objective of this tutorial is to let IR researchers familiarise themselves with these tools and understand the basic ideas behind them.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115169593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Fast Feature Selection for Learning to Rank 快速特征选择学习排名
Andrea Gigli, C. Lucchese, F. M. Nardini, R. Perego
An emerging research area named Learning-to-Rank (LtR) has shown that effective solutions to the ranking problem can leverage machine learning techniques applied to a large set of features capturing the relevance of a candidate document for the user query. Large-scale search systems must however answer user queries very fast, and the computation of the features for candidate documents must comply with strict back-end latency constraints. The number of features cannot thus grow beyond a given limit, and Feature Selection (FS) techniques have to be exploited to find a subset of features that both meets latency requirements and leads to high effectiveness of the trained models. In this paper, we propose three new algorithms for FS specifically designed for the LtR context where hundreds of continuous or categorical features can be involved. We present a comprehensive experimental analysis conducted on publicly available LtR datasets and we show that the proposed strategies outperform a well-known state-of-the-art competitor.
一个名为“学习排序”(LtR)的新兴研究领域表明,排序问题的有效解决方案可以利用机器学习技术,将其应用于大量特征集,以捕获用户查询的候选文档的相关性。然而,大规模搜索系统必须非常快速地回答用户查询,候选文档的特征计算必须遵守严格的后端延迟约束。因此,特征的数量不能超过给定的限制,并且必须利用特征选择(FS)技术来找到既满足延迟要求又能使训练模型高效的特征子集。在本文中,我们提出了三种专门为LtR上下文设计的FS新算法,其中可能涉及数百个连续或分类特征。我们对公开可用的LtR数据集进行了全面的实验分析,并表明所提出的策略优于知名的最先进的竞争对手。
{"title":"Fast Feature Selection for Learning to Rank","authors":"Andrea Gigli, C. Lucchese, F. M. Nardini, R. Perego","doi":"10.1145/2970398.2970433","DOIUrl":"https://doi.org/10.1145/2970398.2970433","url":null,"abstract":"An emerging research area named Learning-to-Rank (LtR) has shown that effective solutions to the ranking problem can leverage machine learning techniques applied to a large set of features capturing the relevance of a candidate document for the user query. Large-scale search systems must however answer user queries very fast, and the computation of the features for candidate documents must comply with strict back-end latency constraints. The number of features cannot thus grow beyond a given limit, and Feature Selection (FS) techniques have to be exploited to find a subset of features that both meets latency requirements and leads to high effectiveness of the trained models. In this paper, we propose three new algorithms for FS specifically designed for the LtR context where hundreds of continuous or categorical features can be involved. We present a comprehensive experimental analysis conducted on publicly available LtR datasets and we show that the proposed strategies outperform a well-known state-of-the-art competitor.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"89 36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129793672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Temporal Query Expansion Using a Continuous Hidden Markov Model 使用连续隐马尔可夫模型的时态查询扩展
J. Rao, Jimmy J. Lin
In standard formulations of pseudo-relevance feedback, document timestamps do not play a role in identifying expansion terms. Yet we know that when searching social media posts such as tweets, relevant documents are bursty and usually occur in temporal clusters. The main insight of our work is that term expansions should be biased to draw from documents that occur in bursty temporal clusters. This is formally captured by a continuous hidden Markov model (cHMM), for which we derive an EM algorithm for parameter estimation. Given a query, we estimate the parameters for a cHMM that best explains the observed distribution of an initial set of retrieved documents, and then use Viterbi decoding to compute the most likely state sequence. In identifying expansion terms, we only select documents from bursty states. Experiments on test collections from the TREC 2011 and 2012 Microblog tracks show that our approach is significantly more effective than the popular RM3 pseudo-relevance feedback model.
在伪相关反馈的标准公式中,文档时间戳在识别扩展术语方面不起作用。然而,我们知道,在搜索twitter等社交媒体帖子时,相关文档是突发的,通常出现在时间集群中。我们工作的主要见解是,术语展开应该偏向于从突发时间集群中出现的文档中提取。这是由一个连续隐马尔可夫模型(cHMM)正式捕获的,为此我们推导了一个用于参数估计的EM算法。给定一个查询,我们估计cHMM的参数,该参数最好地解释了检索文档的初始集合的观察分布,然后使用Viterbi解码来计算最可能的状态序列。在标识展开项时,我们只从突发状态中选择文档。对TREC 2011年和2012年微博曲目的测试集进行的实验表明,我们的方法明显比流行的RM3伪相关反馈模型更有效。
{"title":"Temporal Query Expansion Using a Continuous Hidden Markov Model","authors":"J. Rao, Jimmy J. Lin","doi":"10.1145/2970398.2970424","DOIUrl":"https://doi.org/10.1145/2970398.2970424","url":null,"abstract":"In standard formulations of pseudo-relevance feedback, document timestamps do not play a role in identifying expansion terms. Yet we know that when searching social media posts such as tweets, relevant documents are bursty and usually occur in temporal clusters. The main insight of our work is that term expansions should be biased to draw from documents that occur in bursty temporal clusters. This is formally captured by a continuous hidden Markov model (cHMM), for which we derive an EM algorithm for parameter estimation. Given a query, we estimate the parameters for a cHMM that best explains the observed distribution of an initial set of retrieved documents, and then use Viterbi decoding to compute the most likely state sequence. In identifying expansion terms, we only select documents from bursty states. Experiments on test collections from the TREC 2011 and 2012 Microblog tracks show that our approach is significantly more effective than the popular RM3 pseudo-relevance feedback model.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129849111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Exploring Urban Lifestyles Using a Nonparametric Temporal Graphical Model 使用非参数时间图形模型探索城市生活方式
Shoaib Jameel, Yi Liao, Wai Lam, S. Schockaert, Xing Xie
We propose a new unsupervised nonparametric temporal topic model to discover lifestyle patterns from location-based social networks. By relating the textual content, time stamps, and venue categories associated to user check-ins, our framework detects the predominant lifestyle patterns in a given geographic region. The temporal component of our model allows us to analyse the evolution of lifestyle patterns throughout the year. We provide examples of interesting patterns that have been discovered by our model, and we show that our model compares favourably to existing approaches in terms of lifestyle pattern quality and computation time. We also quantitatively show that our model outperforms existing methods in a time stamp prediction task.
我们提出了一种新的无监督非参数时间主题模型,用于从基于位置的社交网络中发现生活方式模式。通过将文本内容、时间戳和与用户签到相关的地点类别关联起来,我们的框架可以检测给定地理区域的主要生活方式模式。我们的模型的时间成分使我们能够分析全年生活方式的演变。我们提供了由我们的模型发现的有趣模式的例子,并表明我们的模型在生活方式模式质量和计算时间方面优于现有方法。我们还定量地表明,我们的模型在时间戳预测任务中优于现有的方法。
{"title":"Exploring Urban Lifestyles Using a Nonparametric Temporal Graphical Model","authors":"Shoaib Jameel, Yi Liao, Wai Lam, S. Schockaert, Xing Xie","doi":"10.1145/2970398.2970401","DOIUrl":"https://doi.org/10.1145/2970398.2970401","url":null,"abstract":"We propose a new unsupervised nonparametric temporal topic model to discover lifestyle patterns from location-based social networks. By relating the textual content, time stamps, and venue categories associated to user check-ins, our framework detects the predominant lifestyle patterns in a given geographic region. The temporal component of our model allows us to analyse the evolution of lifestyle patterns throughout the year. We provide examples of interesting patterns that have been discovered by our model, and we show that our model compares favourably to existing approaches in terms of lifestyle pattern quality and computation time. We also quantitatively show that our model outperforms existing methods in a time stamp prediction task.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125743856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Collaborative Information Retrieval: Frameworks, Theoretical Models, and Emerging Topics 协同信息检索:框架、理论模型和新兴主题
L. Tamine, L. Soulier
A great amount of research in the IR domain mostly dealt with both the design of enhanced document ranking models allowing search improvement through user-to-system collaboration. However, in addition to user-to-system form of collaboration, user-to-user collaboration is increasingly acknowledged as an effective mean for gathering the complementary skills and/or knowledge of individual users in order to solve complex search tasks. This tutorial will first give an overview of the ways into collaboration has been implemented in IR models with the attempt of improving the search outcomes with respect to several tasks and related frameworks (ad-hoc search, group-based recommendation, social search, collaborative search). Second, as envisioned in collaborative IR domain (CIR), we will focus on the theoretical models that support and drive user-to-user collaboration in order to perform shared IR tasks. Third, we will develop a road map on emerging and relevant topics addressing issues related to collaboration design. Our goal is to provide participants with concepts and motivation allowing them to investigate this emerging IR domain as well as giving them some clues on how to tackle issues related to the optimization of collaborative tasks. More specifically, the tutorial aims to: (a) Give an overview of the key concept of collaboration in IR and related research topics; (b) Present state-of-the art CIR techniques and models; (c) Discuss about the emerging topics that deal with collaboration; (d) Point out some challenges ahead.
IR领域的大量研究主要涉及增强文档排序模型的设计,从而通过用户与系统的协作来改进搜索。然而,除了用户到系统的协作形式之外,用户到用户的协作越来越被认为是收集单个用户的互补技能和/或知识以解决复杂搜索任务的有效手段。本教程将首先概述在IR模型中实现协作的方式,并尝试改进针对若干任务和相关框架(特别搜索、基于组的推荐、社交搜索、协作搜索)的搜索结果。其次,正如协同IR领域(CIR)所设想的那样,我们将重点关注支持和推动用户对用户协作以执行共享IR任务的理论模型。第三,我们将针对新兴和相关主题制定路线图,解决与协作设计相关的问题。我们的目标是为参与者提供概念和动机,使他们能够研究这个新兴的IR领域,并为他们提供一些关于如何解决与协作任务优化相关的问题的线索。更具体地说,该教程旨在:(a)概述IR和相关研究课题中协作的关键概念;(b)目前最先进的CIR技术和模型;(c)讨论有关合作的新专题;(d)指出今后的一些挑战。
{"title":"Collaborative Information Retrieval: Frameworks, Theoretical Models, and Emerging Topics","authors":"L. Tamine, L. Soulier","doi":"10.1145/2970398.2970442","DOIUrl":"https://doi.org/10.1145/2970398.2970442","url":null,"abstract":"A great amount of research in the IR domain mostly dealt with both the design of enhanced document ranking models allowing search improvement through user-to-system collaboration. However, in addition to user-to-system form of collaboration, user-to-user collaboration is increasingly acknowledged as an effective mean for gathering the complementary skills and/or knowledge of individual users in order to solve complex search tasks. This tutorial will first give an overview of the ways into collaboration has been implemented in IR models with the attempt of improving the search outcomes with respect to several tasks and related frameworks (ad-hoc search, group-based recommendation, social search, collaborative search). Second, as envisioned in collaborative IR domain (CIR), we will focus on the theoretical models that support and drive user-to-user collaboration in order to perform shared IR tasks. Third, we will develop a road map on emerging and relevant topics addressing issues related to collaboration design. Our goal is to provide participants with concepts and motivation allowing them to investigate this emerging IR domain as well as giving them some clues on how to tackle issues related to the optimization of collaborative tasks. More specifically, the tutorial aims to: (a) Give an overview of the key concept of collaboration in IR and related research topics; (b) Present state-of-the art CIR techniques and models; (c) Discuss about the emerging topics that deal with collaboration; (d) Point out some challenges ahead.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125687537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1