首页 > 最新文献

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval最新文献

英文 中文
Cross-Language Microblog Retrieval using Latent Semantic Modeling 基于潜在语义建模的跨语言微博检索
Archana Godavarthy, Yi Fang
Microblogging has become one of the major tools of sharing real-time information for people around the world. Finding relevant information across different languages on microblogs is highly desirable especially for the large number of multilingual users. However, the characteristics of microblog content pose great challenges to the existing cross-language information retrieval approaches. In this paper, we address the task of retrieving relevant tweets given another tweet in a different language. We build parallel corpora for tweets in different languages by bridging them via shared hashtags. We propose a latent semantic approach to model the parallel corpora by mapping the parallel tweets to a low-dimensional shared semantic space. The relevance between tweets in different languages is measured in this shared latent space and the model is trained on a pairwise loss function. The preliminary experiments on a Twitter dataset demonstrate the effectiveness of the proposed approach.
微博已经成为世界各地人们分享实时信息的主要工具之一。在微博上找到不同语言的相关信息是非常可取的,特别是对于大量的多语言用户。然而,微博内容的特点对现有的跨语言信息检索方法提出了很大的挑战。在本文中,我们解决了在给定另一条不同语言的推文的情况下检索相关推文的任务。我们通过共享标签为不同语言的推文建立了并行语料库。我们提出了一种潜在语义方法,通过将并行推文映射到低维共享语义空间来建模并行语料库。在这个共享的潜在空间中测量不同语言推文之间的相关性,并在成对损失函数上训练模型。在Twitter数据集上的初步实验证明了该方法的有效性。
{"title":"Cross-Language Microblog Retrieval using Latent Semantic Modeling","authors":"Archana Godavarthy, Yi Fang","doi":"10.1145/2970398.2970436","DOIUrl":"https://doi.org/10.1145/2970398.2970436","url":null,"abstract":"Microblogging has become one of the major tools of sharing real-time information for people around the world. Finding relevant information across different languages on microblogs is highly desirable especially for the large number of multilingual users. However, the characteristics of microblog content pose great challenges to the existing cross-language information retrieval approaches. In this paper, we address the task of retrieving relevant tweets given another tweet in a different language. We build parallel corpora for tweets in different languages by bridging them via shared hashtags. We propose a latent semantic approach to model the parallel corpora by mapping the parallel tweets to a low-dimensional shared semantic space. The relevance between tweets in different languages is measured in this shared latent space and the model is trained on a pairwise loss function. The preliminary experiments on a Twitter dataset demonstrate the effectiveness of the proposed approach.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114972533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Learning to Rank User Queries to Detect Search Tasks 学习对用户查询进行排序以检测搜索任务
C. Lucchese, F. M. Nardini, S. Orlando, Gabriele Tolomei
We present a framework for discovering sets of web queries having similar latent needs, called search tasks, from user queries stored in a search engine log. The framework is made of two main modules: Query Similarity Learning (QSL) and Graph-based Query Clustering (GQC). The former is devoted to learning a query similarity function from a ground truth of manually-labeled search tasks. The latter represents each user search log as a graph whose nodes are queries, and uses the learned similarity function to weight edges between query pairs. Finally, search tasks are detected by clustering those queries in the graph which are connected by the strongest links, in fact by detecting the strongest connected components of the graph. To discriminate between "strong" and "weak" links also the GQC module entails a learning phase whose goal is to estimate the best threshold for pruning the edges of the graph. We discuss how the QSL module can be effectively implemented using Learning to Rank (L2R) techniques. Experiments on a real-world search engine log show that query similarity functions learned using L2R lead to better performing GQC implementations when compared to similarity functions induced by other state-of-the-art machine learning solutions, such as logistic regression and decision trees.
我们提出了一个框架,用于从存储在搜索引擎日志中的用户查询中发现具有相似潜在需求的web查询集,称为搜索任务。该框架由两个主要模块组成:查询相似学习(QSL)和基于图的查询聚类(GQC)。前者致力于从人工标记的搜索任务的基本事实中学习查询相似度函数。后者将每个用户搜索日志表示为以查询为节点的图,并使用学习到的相似度函数对查询对之间的边进行加权。最后,通过聚类图中由最强链接连接的查询来检测搜索任务,实际上是通过检测图中最强连接的组件。为了区分“强”和“弱”链接,GQC模块还需要一个学习阶段,其目标是估计修剪图边的最佳阈值。我们讨论了如何使用学习排序(L2R)技术有效地实现QSL模块。在真实搜索引擎日志上的实验表明,与其他最先进的机器学习解决方案(如逻辑回归和决策树)诱导的相似函数相比,使用L2R学习的查询相似函数可以更好地实现GQC。
{"title":"Learning to Rank User Queries to Detect Search Tasks","authors":"C. Lucchese, F. M. Nardini, S. Orlando, Gabriele Tolomei","doi":"10.1145/2970398.2970407","DOIUrl":"https://doi.org/10.1145/2970398.2970407","url":null,"abstract":"We present a framework for discovering sets of web queries having similar latent needs, called search tasks, from user queries stored in a search engine log. The framework is made of two main modules: Query Similarity Learning (QSL) and Graph-based Query Clustering (GQC). The former is devoted to learning a query similarity function from a ground truth of manually-labeled search tasks. The latter represents each user search log as a graph whose nodes are queries, and uses the learned similarity function to weight edges between query pairs. Finally, search tasks are detected by clustering those queries in the graph which are connected by the strongest links, in fact by detecting the strongest connected components of the graph. To discriminate between \"strong\" and \"weak\" links also the GQC module entails a learning phase whose goal is to estimate the best threshold for pruning the edges of the graph. We discuss how the QSL module can be effectively implemented using Learning to Rank (L2R) techniques. Experiments on a real-world search engine log show that query similarity functions learned using L2R lead to better performing GQC implementations when compared to similarity functions induced by other state-of-the-art machine learning solutions, such as logistic regression and decision trees.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121802506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Total Recall: Blue Sky on Mars 《全面回忆:火星蓝天
C. Clarke, G. Cormack, Jimmy J. Lin, Adam Roegiest
There are presently plans to create permanent colonies on Mars so that humanity will have a second home. These colonists will need search, email, entertainment, and indeed most services provided on the modern web. The primary challenge is network latencies, since the two planets are anywhere from 4 to 24 light minutes apart. A recent article sketches out how we might develop search technologies for Mars based on physically transporting a cache of the web to Mars, to which updates are applied via predictive models. Within this general framework, we explore the problem of high-recall retrieval, such as conducting a scientific survey. We explore simple techniques for masking speed-of-light delays and find that "priming" the search process with a small Martian cache is sufficient to mask a moderate amount of network latency. Simulation experiments show that it is possible to engineer high-recall search from Mars to be quite similar to the experience on Earth.
目前有在火星上建立永久殖民地的计划,这样人类将有第二个家。这些殖民者将需要搜索、电子邮件、娱乐,以及现代网络上提供的大多数服务。主要的挑战是网络延迟,因为两个行星之间的距离在4到24光分钟之间。最近的一篇文章概述了我们如何开发火星搜索技术,该技术基于将网络缓存物理传输到火星,并通过预测模型对其进行更新。在此框架下,我们探讨了高查全率检索的问题,如进行科学调查。我们探索了掩盖光速延迟的简单技术,发现用一个小的火星缓存“启动”搜索过程足以掩盖适度的网络延迟。模拟实验表明,在火星上设计高召回率的搜索是可能的,与在地球上的体验非常相似。
{"title":"Total Recall: Blue Sky on Mars","authors":"C. Clarke, G. Cormack, Jimmy J. Lin, Adam Roegiest","doi":"10.1145/2970398.2970430","DOIUrl":"https://doi.org/10.1145/2970398.2970430","url":null,"abstract":"There are presently plans to create permanent colonies on Mars so that humanity will have a second home. These colonists will need search, email, entertainment, and indeed most services provided on the modern web. The primary challenge is network latencies, since the two planets are anywhere from 4 to 24 light minutes apart. A recent article sketches out how we might develop search technologies for Mars based on physically transporting a cache of the web to Mars, to which updates are applied via predictive models. Within this general framework, we explore the problem of high-recall retrieval, such as conducting a scientific survey. We explore simple techniques for masking speed-of-light delays and find that \"priming\" the search process with a small Martian cache is sufficient to mask a moderate amount of network latency. Simulation experiments show that it is possible to engineer high-recall search from Mars to be quite similar to the experience on Earth.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128549280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Unified Energy-based Framework for Learning to Rank 一个统一的基于能量的学习排名框架
Yi Fang, Mengwen Liu
Learning to Rank (L2R) has emerged as one of the core machine learning techniques for IR. On the other hand, Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. They have produced impressive results in many computer vision and speech recognition tasks. In this paper, we introduce a unified view of Learning to Rank that integrates various L2R approaches in an energy-based ranking framework. In this framework, an energy function associates low energies to desired documents and high energies to undesired results. Learning is essentially the process of shaping the energy surface so that desired documents have lower energies. The proposed framework yields new insights into learning to rank. First, we show how various existing L2R models (pointwise, pairwise, and listwise) can be cast in the energy-based framework. Second, new L2R models can be constructed based on existing EBMs. Furthermore, inspired by the intuitive learning process of EBMs, we can devise novel energy-based models for ranking tasks. We introduce several new energy-based ranking models based on the proposed framework. The experiments are conducted on the public LETOR 4.0 benchmarks and demonstrate the effectiveness of the proposed models.
排名学习(L2R)已经成为IR的核心机器学习技术之一。另一方面,基于能量的模型(EBMs)通过将标量能量与变量的每个配置相关联来捕获变量之间的依赖关系。他们在许多计算机视觉和语音识别任务中取得了令人印象深刻的成果。在本文中,我们介绍了一个统一的排名学习视图,该视图将各种L2R方法集成在基于能量的排名框架中。在这个框架中,能量函数将低能与期望的文档联系起来,高能与不希望的结果联系起来。学习本质上是塑造能量表面的过程,以使所需的文档具有较低的能量。提出的框架为学习排名提供了新的见解。首先,我们将展示如何在基于能量的框架中构建各种现有的L2R模型(点、成对和列表)。其次,可以在现有EBMs的基础上构建新的L2R模型。此外,受EBMs直观学习过程的启发,我们可以设计出新的基于能量的任务排序模型。在该框架的基础上,引入了几种新的基于能量的排序模型。在公开的LETOR 4.0基准上进行了实验,并证明了所提出模型的有效性。
{"title":"A Unified Energy-based Framework for Learning to Rank","authors":"Yi Fang, Mengwen Liu","doi":"10.1145/2970398.2970416","DOIUrl":"https://doi.org/10.1145/2970398.2970416","url":null,"abstract":"Learning to Rank (L2R) has emerged as one of the core machine learning techniques for IR. On the other hand, Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. They have produced impressive results in many computer vision and speech recognition tasks. In this paper, we introduce a unified view of Learning to Rank that integrates various L2R approaches in an energy-based ranking framework. In this framework, an energy function associates low energies to desired documents and high energies to undesired results. Learning is essentially the process of shaping the energy surface so that desired documents have lower energies. The proposed framework yields new insights into learning to rank. First, we show how various existing L2R models (pointwise, pairwise, and listwise) can be cast in the energy-based framework. Second, new L2R models can be constructed based on existing EBMs. Furthermore, inspired by the intuitive learning process of EBMs, we can devise novel energy-based models for ranking tasks. We introduce several new energy-based ranking models based on the proposed framework. The experiments are conducted on the public LETOR 4.0 benchmarks and demonstrate the effectiveness of the proposed models.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128590430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
End to End Long Short Term Memory Networks for Non-Factoid Question Answering 非虚构问答的端到端长短期记忆网络
Daniel Cohen, W. Bruce Croft
Retrieving correct answers for non-factoid queries poses significant challenges for current answer retrieval methods. Methods either involve the laborious task of extracting numerous features or are ineffective for longer answers. We approach the task of non-factoid question answering using deep learning methods without the need of feature extraction. Neural networks are capable of learning complex relations based on relatively simple features which make them a prime candidate for relating non-factoid questions to their answers. In this paper, we show that end to end training with a Bidirectional Long Short Term Memory (BLSTM) network with a rank sensitive loss function results in significant performance improvements over previous approaches without the need for combining additional models.
为非事实查询检索正确答案对当前的答案检索方法提出了重大挑战。方法要么涉及提取大量特征的繁重任务,要么对较长的答案无效。我们在不需要特征提取的情况下使用深度学习方法来解决非事实问题回答的任务。神经网络能够根据相对简单的特征学习复杂的关系,这使它们成为将非事实问题与其答案联系起来的主要候选者。在本文中,我们证明了使用具有秩敏感损失函数的双向长短期记忆(BLSTM)网络进行端到端训练比以前的方法具有显着的性能改进,而无需组合额外的模型。
{"title":"End to End Long Short Term Memory Networks for Non-Factoid Question Answering","authors":"Daniel Cohen, W. Bruce Croft","doi":"10.1145/2970398.2970438","DOIUrl":"https://doi.org/10.1145/2970398.2970438","url":null,"abstract":"Retrieving correct answers for non-factoid queries poses significant challenges for current answer retrieval methods. Methods either involve the laborious task of extracting numerous features or are ineffective for longer answers. We approach the task of non-factoid question answering using deep learning methods without the need of feature extraction. Neural networks are capable of learning complex relations based on relatively simple features which make them a prime candidate for relating non-factoid questions to their answers. In this paper, we show that end to end training with a Bidirectional Long Short Term Memory (BLSTM) network with a rank sensitive loss function results in significant performance improvements over previous approaches without the need for combining additional models.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131928814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Classifying User Search Intents for Query Auto-Completion 分类用户搜索意图查询自动完成
Jyun-Yu Jiang, Pu-Jen Cheng
The function of query auto-completion in modern search engines is to help users formulate queries fast and precisely. Conventional context-aware methods primarily rank candidate queries according to term- and query- relationships to the context. However, most sessions are extremely short. How to capture search intents with such relationships becomes difficult when the context generally contains only few queries. In this paper, we investigate the feasibility of discovering search intents within short context for query auto-completion. The class distribution of the search session (i.e., issued queries and click behavior) is derived as search intents. Several distribution-based features are proposed to estimate the proximity between candidates and search intents. Finally, we apply learning-to-rank to predict the user's intended query according to these features. Moreover, we also design an ensemble model to combine the benefits of our proposed features and term-based conventional approaches. Extensive experiments have been conducted on the publicly available AOL search engine log. The experimental results demonstrate that our approach significantly outperforms six competitive baselines. The performance of keystrokes is also evaluated in experiments. Furthermore, an in-depth analysis is made to justify the usability of search intent classification for query auto-completion.
现代搜索引擎的查询自动补全功能就是帮助用户快速准确地制定查询。传统的上下文感知方法主要根据术语和查询与上下文的关系对候选查询进行排序。然而,大多数会话都非常短。当上下文通常只包含很少的查询时,如何捕获具有此类关系的搜索意图变得困难。在本文中,我们研究了在短上下文中发现搜索意图用于查询自动完成的可行性。搜索会话的类分布(即发出的查询和单击行为)派生为搜索意图。提出了几个基于分布的特征来估计候选对象和搜索意图之间的接近度。最后,我们根据这些特征应用排序学习来预测用户的预期查询。此外,我们还设计了一个集成模型来结合我们提出的特征和基于术语的传统方法的优点。在公开可用的AOL搜索引擎日志上进行了广泛的实验。实验结果表明,我们的方法明显优于六个竞争基线。在实验中对击键的性能进行了评价。此外,深入分析了搜索意图分类对查询自动完成的可用性。
{"title":"Classifying User Search Intents for Query Auto-Completion","authors":"Jyun-Yu Jiang, Pu-Jen Cheng","doi":"10.1145/2970398.2970400","DOIUrl":"https://doi.org/10.1145/2970398.2970400","url":null,"abstract":"The function of query auto-completion in modern search engines is to help users formulate queries fast and precisely. Conventional context-aware methods primarily rank candidate queries according to term- and query- relationships to the context. However, most sessions are extremely short. How to capture search intents with such relationships becomes difficult when the context generally contains only few queries. In this paper, we investigate the feasibility of discovering search intents within short context for query auto-completion. The class distribution of the search session (i.e., issued queries and click behavior) is derived as search intents. Several distribution-based features are proposed to estimate the proximity between candidates and search intents. Finally, we apply learning-to-rank to predict the user's intended query according to these features. Moreover, we also design an ensemble model to combine the benefits of our proposed features and term-based conventional approaches. Extensive experiments have been conducted on the publicly available AOL search engine log. The experimental results demonstrate that our approach significantly outperforms six competitive baselines. The performance of keystrokes is also evaluated in experiments. Furthermore, an in-depth analysis is made to justify the usability of search intent classification for query auto-completion.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131020238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Retrievability in API-Based "Evaluation as a Service" 基于api的“评估即服务”中的可检索性
Jiaul H. Paik, Jimmy J. Lin
"Evaluation as a service" (EaaS) refers to a family of related evaluation methodologies that enables community-wide evaluations and the construction of test collections on documents that cannot be easily distributed. In the API-based approach, the basic idea is that evaluation organizers provide a service API through which the evaluation task can be completed, without providing access to the raw collection. One concern with this evaluation approach is that the API introduces biases and limits the diversity of techniques that can be brought to bear on the problem. In this paper, we tackle the question of API bias using the concept of retrievability. The raw data for our analyses come from a naturally-occurring experiment where we observed the same groups completing the same task with the API and also with access to the raw collection. We find that the retrievability bias of runs generated in both cases are comparable. Moreover, the fraction of relevant tweets retrieved through the API by the participating groups is at least as high as when they had access to the raw collection.
“作为服务的评估”(EaaS)指的是一系列相关的评估方法,这些方法能够在社区范围内进行评估,并在不容易分发的文档上构建测试集合。在基于API的方法中,基本思想是评估组织者提供一个服务API,通过该API可以完成评估任务,而不提供对原始集合的访问。这种评估方法的一个问题是,API引入了偏差,并限制了可用于解决问题的技术的多样性。在本文中,我们使用可检索性的概念来解决API偏差的问题。我们分析的原始数据来自于一个自然发生的实验,在这个实验中,我们观察到相同的组使用API完成相同的任务,并访问原始集合。我们发现,在这两种情况下产生的运行的可回收性偏差是可比的。此外,参与组通过API检索的相关tweet的比例至少与他们访问原始集合时一样高。
{"title":"Retrievability in API-Based \"Evaluation as a Service\"","authors":"Jiaul H. Paik, Jimmy J. Lin","doi":"10.1145/2970398.2970427","DOIUrl":"https://doi.org/10.1145/2970398.2970427","url":null,"abstract":"\"Evaluation as a service\" (EaaS) refers to a family of related evaluation methodologies that enables community-wide evaluations and the construction of test collections on documents that cannot be easily distributed. In the API-based approach, the basic idea is that evaluation organizers provide a service API through which the evaluation task can be completed, without providing access to the raw collection. One concern with this evaluation approach is that the API introduces biases and limits the diversity of techniques that can be brought to bear on the problem. In this paper, we tackle the question of API bias using the concept of retrievability. The raw data for our analyses come from a naturally-occurring experiment where we observed the same groups completing the same task with the API and also with access to the raw collection. We find that the retrievability bias of runs generated in both cases are comparable. Moreover, the fraction of relevant tweets retrieved through the API by the participating groups is at least as high as when they had access to the raw collection.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132984799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Query Anchoring Using Discriminative Query Models 使用判别查询模型的查询锚定
Saar Kuzi, Anna Shtok, Oren Kurland
Pseudo-feedback-based query models are induced from a result list of the documents most highly ranked by initial search performed for the query. Since the result list often contains much non-relevant information, query models are anchored to the query using various techniques. We present a novel {em unsupervised} discriminative query model that can be used, by several methods proposed herein, for query anchoring of existing query models. The model is induced from the result list using a learning-to-rank approach, and constitutes a discriminative term-based representation of the initial ranking. We show that applying our methods to generative query models can improve retrieval performance.
基于伪反馈的查询模型是从为查询执行的初始搜索排名最高的文档的结果列表中导出的。由于结果列表通常包含许多不相关的信息,因此使用各种技术将查询模型锚定到查询。我们提出了一种新的{em无监督}判别查询模型,该模型可以通过本文提出的几种方法用于现有查询模型的查询锚定。该模型是使用学习排序方法从结果列表中导出的,并构成了初始排序的基于判别词的表示。我们表明,将我们的方法应用于生成查询模型可以提高检索性能。
{"title":"Query Anchoring Using Discriminative Query Models","authors":"Saar Kuzi, Anna Shtok, Oren Kurland","doi":"10.1145/2970398.2970402","DOIUrl":"https://doi.org/10.1145/2970398.2970402","url":null,"abstract":"Pseudo-feedback-based query models are induced from a result list of the documents most highly ranked by initial search performed for the query. Since the result list often contains much non-relevant information, query models are anchored to the query using various techniques. We present a novel {em unsupervised} discriminative query model that can be used, by several methods proposed herein, for query anchoring of existing query models. The model is induced from the result list using a learning-to-rank approach, and constitutes a discriminative term-based representation of the initial ranking. We show that applying our methods to generative query models can improve retrieval performance.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124439767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
PDF PDF
Ashraf Bah Rabiou, Ben Carterette
Data fusion has been shown to be a simple and effective way to improve retrieval results. Most existing data fusion methods combine ranked lists from different retrieval functions for a single given query. But in many real search settings, the diversity of retrieval functions required to achieve good fusion performance is not available. Researchers are typically limited to a few variants on a scoring function used by the engine of their choice, with these variants often producing similar results due to being based on the same underlying term statistics. This paper presents a framework for data fusion based on combining ranked lists from different queries that users could have entered for their information need. If we can identify a set of "possible queries" for an information need, and estimate probability distributions concerning the probability of generating those queries, the probability of retrieving certain documents for those queries, and the probability of documents being relevant to that information need, we have the potential to dramatically improve results over a baseline system given a single user query. Our framework is based on several component models that can be mixed and matched. We present several simple estimation methods for components. In order to demonstrate effectiveness, we present experimental results on 5 different datasets covering tasks such as ad-hoc search, novelty and diversity search, and search in the presence of implicit user feedback. Our results show strong performances for our method; it is competitive with state-of-the-art methods on the same datasets, and in some cases outperforms them.
{"title":"PDF","authors":"Ashraf Bah Rabiou, Ben Carterette","doi":"10.1145/2970398.2970419","DOIUrl":"https://doi.org/10.1145/2970398.2970419","url":null,"abstract":"Data fusion has been shown to be a simple and effective way to improve retrieval results. Most existing data fusion methods combine ranked lists from different retrieval functions for a single given query. But in many real search settings, the diversity of retrieval functions required to achieve good fusion performance is not available. Researchers are typically limited to a few variants on a scoring function used by the engine of their choice, with these variants often producing similar results due to being based on the same underlying term statistics. This paper presents a framework for data fusion based on combining ranked lists from different queries that users could have entered for their information need. If we can identify a set of \"possible queries\" for an information need, and estimate probability distributions concerning the probability of generating those queries, the probability of retrieving certain documents for those queries, and the probability of documents being relevant to that information need, we have the potential to dramatically improve results over a baseline system given a single user query. Our framework is based on several component models that can be mixed and matched. We present several simple estimation methods for components. In order to demonstrate effectiveness, we present experimental results on 5 different datasets covering tasks such as ad-hoc search, novelty and diversity search, and search in the presence of implicit user feedback. Our results show strong performances for our method; it is competitive with state-of-the-art methods on the same datasets, and in some cases outperforms them.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121792419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Efficient and Effective Higher Order Proximity Modeling 高效的高阶邻近建模
Xiaolu Lu, Alistair Moffat, J. Culpepper
Bag-of-words retrieval models are widely used, and provide a robust trade-off between efficiency and effectiveness. These models often make simplifying assumptions about relations between query terms, and treat term statistics independently. However, query terms are rarely independent, and previous work has repeatedly shown that term dependencies can be critical to improving the effectiveness of ranked retrieval results. Among all term-dependency models, the Markov Random Field (MRF) [Metzler and Croft, SIGIR, 2005] model has received the most attention in recent years. Despite clear effectiveness improvements, these models are not deployed in performance-critical applications because of the potentially high computational costs. As a result, bigram models are generally considered to be the best compromise between full term dependence, and term-independent models such as BM25. Here we provide further evidence that term-dependency features not captured by bag-of-words models can reliably improve retrieval effectiveness. We also present a new variation on the highly-effective MRF model that relies on a BM25-derived potential. The benefit of this approach is that it is built from feature functions which require no higher-order global statistics. We empirically show that our new model reduces retrieval costs by up to 60%, with no loss in effectiveness compared to previous approaches.
词袋检索模型被广泛使用,并且在效率和有效性之间提供了一个稳健的权衡。这些模型通常对查询词之间的关系做出简化的假设,并独立地处理词统计。然而,查询词很少是独立的,以前的工作一再表明,词依赖关系对于提高排序检索结果的有效性至关重要。在所有的术语依赖模型中,Markov Random Field (MRF) [Metzler and Croft, SIGIR, 2005]模型近年来受到了最广泛的关注。尽管有明显的有效性改进,但由于潜在的高计算成本,这些模型没有部署在性能关键型应用程序中。因此,双元模型通常被认为是完全项依赖模型和项独立模型(如BM25)之间的最佳折衷。在这里,我们提供了进一步的证据,证明词袋模型未捕获的术语依赖特征可以可靠地提高检索效率。我们还提出了一种依赖于bm25衍生电位的高效MRF模型的新变体。这种方法的好处是,它是由不需要高阶全局统计的特征函数构建的。我们的经验表明,我们的新模型减少了高达60%的检索成本,与以前的方法相比,没有损失的有效性。
{"title":"Efficient and Effective Higher Order Proximity Modeling","authors":"Xiaolu Lu, Alistair Moffat, J. Culpepper","doi":"10.1145/2970398.2970404","DOIUrl":"https://doi.org/10.1145/2970398.2970404","url":null,"abstract":"Bag-of-words retrieval models are widely used, and provide a robust trade-off between efficiency and effectiveness. These models often make simplifying assumptions about relations between query terms, and treat term statistics independently. However, query terms are rarely independent, and previous work has repeatedly shown that term dependencies can be critical to improving the effectiveness of ranked retrieval results. Among all term-dependency models, the Markov Random Field (MRF) [Metzler and Croft, SIGIR, 2005] model has received the most attention in recent years. Despite clear effectiveness improvements, these models are not deployed in performance-critical applications because of the potentially high computational costs. As a result, bigram models are generally considered to be the best compromise between full term dependence, and term-independent models such as BM25. Here we provide further evidence that term-dependency features not captured by bag-of-words models can reliably improve retrieval effectiveness. We also present a new variation on the highly-effective MRF model that relies on a BM25-derived potential. The benefit of this approach is that it is built from feature functions which require no higher-order global statistics. We empirically show that our new model reduces retrieval costs by up to 60%, with no loss in effectiveness compared to previous approaches.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1