Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval最新文献

英文中文

Cross-Language Microblog Retrieval using Latent Semantic Modeling 基于潜在语义建模的跨语言微博检索

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pub Date : 2016-09-12 DOI: 10.1145/2970398.2970436

Archana Godavarthy, Yi Fang

Microblogging has become one of the major tools of sharing real-time information for people around the world. Finding relevant information across different languages on microblogs is highly desirable especially for the large number of multilingual users. However, the characteristics of microblog content pose great challenges to the existing cross-language information retrieval approaches. In this paper, we address the task of retrieving relevant tweets given another tweet in a different language. We build parallel corpora for tweets in different languages by bridging them via shared hashtags. We propose a latent semantic approach to model the parallel corpora by mapping the parallel tweets to a low-dimensional shared semantic space. The relevance between tweets in different languages is measured in this shared latent space and the model is trained on a pairwise loss function. The preliminary experiments on a Twitter dataset demonstrate the effectiveness of the proposed approach.

微博已经成为世界各地人们分享实时信息的主要工具之一。在微博上找到不同语言的相关信息是非常可取的，特别是对于大量的多语言用户。然而，微博内容的特点对现有的跨语言信息检索方法提出了很大的挑战。在本文中，我们解决了在给定另一条不同语言的推文的情况下检索相关推文的任务。我们通过共享标签为不同语言的推文建立了并行语料库。我们提出了一种潜在语义方法，通过将并行推文映射到低维共享语义空间来建模并行语料库。在这个共享的潜在空间中测量不同语言推文之间的相关性，并在成对损失函数上训练模型。在Twitter数据集上的初步实验证明了该方法的有效性。

引用次数: 4

Learning to Rank User Queries to Detect Search Tasks 学习对用户查询进行排序以检测搜索任务

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pub Date : 2016-09-12 DOI: 10.1145/2970398.2970407

C. Lucchese, F. M. Nardini, S. Orlando, Gabriele Tolomei

We present a framework for discovering sets of web queries having similar latent needs, called search tasks, from user queries stored in a search engine log. The framework is made of two main modules: Query Similarity Learning (QSL) and Graph-based Query Clustering (GQC). The former is devoted to learning a query similarity function from a ground truth of manually-labeled search tasks. The latter represents each user search log as a graph whose nodes are queries, and uses the learned similarity function to weight edges between query pairs. Finally, search tasks are detected by clustering those queries in the graph which are connected by the strongest links, in fact by detecting the strongest connected components of the graph. To discriminate between "strong" and "weak" links also the GQC module entails a learning phase whose goal is to estimate the best threshold for pruning the edges of the graph. We discuss how the QSL module can be effectively implemented using Learning to Rank (L2R) techniques. Experiments on a real-world search engine log show that query similarity functions learned using L2R lead to better performing GQC implementations when compared to similarity functions induced by other state-of-the-art machine learning solutions, such as logistic regression and decision trees.

我们提出了一个框架，用于从存储在搜索引擎日志中的用户查询中发现具有相似潜在需求的web查询集，称为搜索任务。该框架由两个主要模块组成:查询相似学习(QSL)和基于图的查询聚类(GQC)。前者致力于从人工标记的搜索任务的基本事实中学习查询相似度函数。后者将每个用户搜索日志表示为以查询为节点的图，并使用学习到的相似度函数对查询对之间的边进行加权。最后，通过聚类图中由最强链接连接的查询来检测搜索任务，实际上是通过检测图中最强连接的组件。为了区分“强”和“弱”链接，GQC模块还需要一个学习阶段，其目标是估计修剪图边的最佳阈值。我们讨论了如何使用学习排序(L2R)技术有效地实现QSL模块。在真实搜索引擎日志上的实验表明，与其他最先进的机器学习解决方案(如逻辑回归和决策树)诱导的相似函数相比，使用L2R学习的查询相似函数可以更好地实现GQC。

{"title":"Learning to Rank User Queries to Detect Search Tasks","authors":"C. Lucchese, F. M. Nardini, S. Orlando, Gabriele Tolomei","doi":"10.1145/2970398.2970407","DOIUrl":"https://doi.org/10.1145/2970398.2970407","url":null,"abstract":"We present a framework for discovering sets of web queries having similar latent needs, called search tasks, from user queries stored in a search engine log. The framework is made of two main modules: Query Similarity Learning (QSL) and Graph-based Query Clustering (GQC). The former is devoted to learning a query similarity function from a ground truth of manually-labeled search tasks. The latter represents each user search log as a graph whose nodes are queries, and uses the learned similarity function to weight edges between query pairs. Finally, search tasks are detected by clustering those queries in the graph which are connected by the strongest links, in fact by detecting the strongest connected components of the graph. To discriminate between \"strong\" and \"weak\" links also the GQC module entails a learning phase whose goal is to estimate the best threshold for pruning the edges of the graph. We discuss how the QSL module can be effectively implemented using Learning to Rank (L2R) techniques. Experiments on a real-world search engine log show that query similarity functions learned using L2R lead to better performing GQC implementations when compared to similarity functions induced by other state-of-the-art machine learning solutions, such as logistic regression and decision trees.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121802506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Total Recall: Blue Sky on Mars 《全面回忆:火星蓝天

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pub Date : 2016-09-12 DOI: 10.1145/2970398.2970430

C. Clarke, G. Cormack, Jimmy J. Lin, Adam Roegiest

There are presently plans to create permanent colonies on Mars so that humanity will have a second home. These colonists will need search, email, entertainment, and indeed most services provided on the modern web. The primary challenge is network latencies, since the two planets are anywhere from 4 to 24 light minutes apart. A recent article sketches out how we might develop search technologies for Mars based on physically transporting a cache of the web to Mars, to which updates are applied via predictive models. Within this general framework, we explore the problem of high-recall retrieval, such as conducting a scientific survey. We explore simple techniques for masking speed-of-light delays and find that "priming" the search process with a small Martian cache is sufficient to mask a moderate amount of network latency. Simulation experiments show that it is possible to engineer high-recall search from Mars to be quite similar to the experience on Earth.

目前有在火星上建立永久殖民地的计划，这样人类将有第二个家。这些殖民者将需要搜索、电子邮件、娱乐，以及现代网络上提供的大多数服务。主要的挑战是网络延迟，因为两个行星之间的距离在4到24光分钟之间。最近的一篇文章概述了我们如何开发火星搜索技术，该技术基于将网络缓存物理传输到火星，并通过预测模型对其进行更新。在此框架下，我们探讨了高查全率检索的问题，如进行科学调查。我们探索了掩盖光速延迟的简单技术，发现用一个小的火星缓存“启动”搜索过程足以掩盖适度的网络延迟。模拟实验表明，在火星上设计高召回率的搜索是可能的，与在地球上的体验非常相似。

引用次数: 1

A Unified Energy-based Framework for Learning to Rank 一个统一的基于能量的学习排名框架

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pub Date : 2016-09-12 DOI: 10.1145/2970398.2970416

Yi Fang, Mengwen Liu

Learning to Rank (L2R) has emerged as one of the core machine learning techniques for IR. On the other hand, Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. They have produced impressive results in many computer vision and speech recognition tasks. In this paper, we introduce a unified view of Learning to Rank that integrates various L2R approaches in an energy-based ranking framework. In this framework, an energy function associates low energies to desired documents and high energies to undesired results. Learning is essentially the process of shaping the energy surface so that desired documents have lower energies. The proposed framework yields new insights into learning to rank. First, we show how various existing L2R models (pointwise, pairwise, and listwise) can be cast in the energy-based framework. Second, new L2R models can be constructed based on existing EBMs. Furthermore, inspired by the intuitive learning process of EBMs, we can devise novel energy-based models for ranking tasks. We introduce several new energy-based ranking models based on the proposed framework. The experiments are conducted on the public LETOR 4.0 benchmarks and demonstrate the effectiveness of the proposed models.

排名学习(L2R)已经成为IR的核心机器学习技术之一。另一方面，基于能量的模型(EBMs)通过将标量能量与变量的每个配置相关联来捕获变量之间的依赖关系。他们在许多计算机视觉和语音识别任务中取得了令人印象深刻的成果。在本文中，我们介绍了一个统一的排名学习视图，该视图将各种L2R方法集成在基于能量的排名框架中。在这个框架中，能量函数将低能与期望的文档联系起来，高能与不希望的结果联系起来。学习本质上是塑造能量表面的过程，以使所需的文档具有较低的能量。提出的框架为学习排名提供了新的见解。首先，我们将展示如何在基于能量的框架中构建各种现有的L2R模型(点、成对和列表)。其次，可以在现有EBMs的基础上构建新的L2R模型。此外，受EBMs直观学习过程的启发，我们可以设计出新的基于能量的任务排序模型。在该框架的基础上，引入了几种新的基于能量的排序模型。在公开的LETOR 4.0基准上进行了实验，并证明了所提出模型的有效性。

{"title":"A Unified Energy-based Framework for Learning to Rank","authors":"Yi Fang, Mengwen Liu","doi":"10.1145/2970398.2970416","DOIUrl":"https://doi.org/10.1145/2970398.2970416","url":null,"abstract":"Learning to Rank (L2R) has emerged as one of the core machine learning techniques for IR. On the other hand, Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. They have produced impressive results in many computer vision and speech recognition tasks. In this paper, we introduce a unified view of Learning to Rank that integrates various L2R approaches in an energy-based ranking framework. In this framework, an energy function associates low energies to desired documents and high energies to undesired results. Learning is essentially the process of shaping the energy surface so that desired documents have lower energies. The proposed framework yields new insights into learning to rank. First, we show how various existing L2R models (pointwise, pairwise, and listwise) can be cast in the energy-based framework. Second, new L2R models can be constructed based on existing EBMs. Furthermore, inspired by the intuitive learning process of EBMs, we can devise novel energy-based models for ranking tasks. We introduce several new energy-based ranking models based on the proposed framework. The experiments are conducted on the public LETOR 4.0 benchmarks and demonstrate the effectiveness of the proposed models.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128590430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

End to End Long Short Term Memory Networks for Non-Factoid Question Answering 非虚构问答的端到端长短期记忆网络

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pub Date : 2016-09-12 DOI: 10.1145/2970398.2970438

Daniel Cohen, W. Bruce Croft

Retrieving correct answers for non-factoid queries poses significant challenges for current answer retrieval methods. Methods either involve the laborious task of extracting numerous features or are ineffective for longer answers. We approach the task of non-factoid question answering using deep learning methods without the need of feature extraction. Neural networks are capable of learning complex relations based on relatively simple features which make them a prime candidate for relating non-factoid questions to their answers. In this paper, we show that end to end training with a Bidirectional Long Short Term Memory (BLSTM) network with a rank sensitive loss function results in significant performance improvements over previous approaches without the need for combining additional models.

为非事实查询检索正确答案对当前的答案检索方法提出了重大挑战。方法要么涉及提取大量特征的繁重任务，要么对较长的答案无效。我们在不需要特征提取的情况下使用深度学习方法来解决非事实问题回答的任务。神经网络能够根据相对简单的特征学习复杂的关系，这使它们成为将非事实问题与其答案联系起来的主要候选者。在本文中，我们证明了使用具有秩敏感损失函数的双向长短期记忆(BLSTM)网络进行端到端训练比以前的方法具有显着的性能改进，而无需组合额外的模型。

引用次数: 49

Classifying User Search Intents for Query Auto-Completion 分类用户搜索意图查询自动完成

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pub Date : 2016-09-12 DOI: 10.1145/2970398.2970400

Jyun-Yu Jiang, Pu-Jen Cheng

The function of query auto-completion in modern search engines is to help users formulate queries fast and precisely. Conventional context-aware methods primarily rank candidate queries according to term- and query- relationships to the context. However, most sessions are extremely short. How to capture search intents with such relationships becomes difficult when the context generally contains only few queries. In this paper, we investigate the feasibility of discovering search intents within short context for query auto-completion. The class distribution of the search session (i.e., issued queries and click behavior) is derived as search intents. Several distribution-based features are proposed to estimate the proximity between candidates and search intents. Finally, we apply learning-to-rank to predict the user's intended query according to these features. Moreover, we also design an ensemble model to combine the benefits of our proposed features and term-based conventional approaches. Extensive experiments have been conducted on the publicly available AOL search engine log. The experimental results demonstrate that our approach significantly outperforms six competitive baselines. The performance of keystrokes is also evaluated in experiments. Furthermore, an in-depth analysis is made to justify the usability of search intent classification for query auto-completion.

现代搜索引擎的查询自动补全功能就是帮助用户快速准确地制定查询。传统的上下文感知方法主要根据术语和查询与上下文的关系对候选查询进行排序。然而，大多数会话都非常短。当上下文通常只包含很少的查询时，如何捕获具有此类关系的搜索意图变得困难。在本文中，我们研究了在短上下文中发现搜索意图用于查询自动完成的可行性。搜索会话的类分布(即发出的查询和单击行为)派生为搜索意图。提出了几个基于分布的特征来估计候选对象和搜索意图之间的接近度。最后，我们根据这些特征应用排序学习来预测用户的预期查询。此外，我们还设计了一个集成模型来结合我们提出的特征和基于术语的传统方法的优点。在公开可用的AOL搜索引擎日志上进行了广泛的实验。实验结果表明，我们的方法明显优于六个竞争基线。在实验中对击键的性能进行了评价。此外，深入分析了搜索意图分类对查询自动完成的可用性。

{"title":"Classifying User Search Intents for Query Auto-Completion","authors":"Jyun-Yu Jiang, Pu-Jen Cheng","doi":"10.1145/2970398.2970400","DOIUrl":"https://doi.org/10.1145/2970398.2970400","url":null,"abstract":"The function of query auto-completion in modern search engines is to help users formulate queries fast and precisely. Conventional context-aware methods primarily rank candidate queries according to term- and query- relationships to the context. However, most sessions are extremely short. How to capture search intents with such relationships becomes difficult when the context generally contains only few queries. In this paper, we investigate the feasibility of discovering search intents within short context for query auto-completion. The class distribution of the search session (i.e., issued queries and click behavior) is derived as search intents. Several distribution-based features are proposed to estimate the proximity between candidates and search intents. Finally, we apply learning-to-rank to predict the user's intended query according to these features. Moreover, we also design an ensemble model to combine the benefits of our proposed features and term-based conventional approaches. Extensive experiments have been conducted on the publicly available AOL search engine log. The experimental results demonstrate that our approach significantly outperforms six competitive baselines. The performance of keystrokes is also evaluated in experiments. Furthermore, an in-depth analysis is made to justify the usability of search intent classification for query auto-completion.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131020238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Retrievability in API-Based "Evaluation as a Service" 基于api的“评估即服务”中的可检索性

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pub Date : 2016-09-12 DOI: 10.1145/2970398.2970427

Jiaul H. Paik, Jimmy J. Lin

"Evaluation as a service" (EaaS) refers to a family of related evaluation methodologies that enables community-wide evaluations and the construction of test collections on documents that cannot be easily distributed. In the API-based approach, the basic idea is that evaluation organizers provide a service API through which the evaluation task can be completed, without providing access to the raw collection. One concern with this evaluation approach is that the API introduces biases and limits the diversity of techniques that can be brought to bear on the problem. In this paper, we tackle the question of API bias using the concept of retrievability. The raw data for our analyses come from a naturally-occurring experiment where we observed the same groups completing the same task with the API and also with access to the raw collection. We find that the retrievability bias of runs generated in both cases are comparable. Moreover, the fraction of relevant tweets retrieved through the API by the participating groups is at least as high as when they had access to the raw collection.

“作为服务的评估”(EaaS)指的是一系列相关的评估方法，这些方法能够在社区范围内进行评估，并在不容易分发的文档上构建测试集合。在基于API的方法中，基本思想是评估组织者提供一个服务API，通过该API可以完成评估任务，而不提供对原始集合的访问。这种评估方法的一个问题是，API引入了偏差，并限制了可用于解决问题的技术的多样性。在本文中，我们使用可检索性的概念来解决API偏差的问题。我们分析的原始数据来自于一个自然发生的实验，在这个实验中，我们观察到相同的组使用API完成相同的任务，并访问原始集合。我们发现，在这两种情况下产生的运行的可回收性偏差是可比的。此外，参与组通过API检索的相关tweet的比例至少与他们访问原始集合时一样高。

引用次数: 8

Query Anchoring Using Discriminative Query Models 使用判别查询模型的查询锚定

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pub Date : 2016-09-12 DOI: 10.1145/2970398.2970402

Saar Kuzi, Anna Shtok, Oren Kurland

Pseudo-feedback-based query models are induced from a result list of the documents most highly ranked by initial search performed for the query. Since the result list often contains much non-relevant information, query models are anchored to the query using various techniques. We present a novel {em unsupervised} discriminative query model that can be used, by several methods proposed herein, for query anchoring of existing query models. The model is induced from the result list using a learning-to-rank approach, and constitutes a discriminative term-based representation of the initial ranking. We show that applying our methods to generative query models can improve retrieval performance.

基于伪反馈的查询模型是从为查询执行的初始搜索排名最高的文档的结果列表中导出的。由于结果列表通常包含许多不相关的信息，因此使用各种技术将查询模型锚定到查询。我们提出了一种新的{em无监督}判别查询模型，该模型可以通过本文提出的几种方法用于现有查询模型的查询锚定。该模型是使用学习排序方法从结果列表中导出的，并构成了初始排序的基于判别词的表示。我们表明，将我们的方法应用于生成查询模型可以提高检索性能。

引用次数: 1

PDF PDF

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pub Date : 2016-09-12 DOI: 10.1145/2970398.2970419

Ashraf Bah Rabiou, Ben Carterette

Data fusion has been shown to be a simple and effective way to improve retrieval results. Most existing data fusion methods combine ranked lists from different retrieval functions for a single given query. But in many real search settings, the diversity of retrieval functions required to achieve good fusion performance is not available. Researchers are typically limited to a few variants on a scoring function used by the engine of their choice, with these variants often producing similar results due to being based on the same underlying term statistics. This paper presents a framework for data fusion based on combining ranked lists from different queries that users could have entered for their information need. If we can identify a set of "possible queries" for an information need, and estimate probability distributions concerning the probability of generating those queries, the probability of retrieving certain documents for those queries, and the probability of documents being relevant to that information need, we have the potential to dramatically improve results over a baseline system given a single user query. Our framework is based on several component models that can be mixed and matched. We present several simple estimation methods for components. In order to demonstrate effectiveness, we present experimental results on 5 different datasets covering tasks such as ad-hoc search, novelty and diversity search, and search in the presence of implicit user feedback. Our results show strong performances for our method; it is competitive with state-of-the-art methods on the same datasets, and in some cases outperforms them.

{"title":"PDF","authors":"Ashraf Bah Rabiou, Ben Carterette","doi":"10.1145/2970398.2970419","DOIUrl":"https://doi.org/10.1145/2970398.2970419","url":null,"abstract":"Data fusion has been shown to be a simple and effective way to improve retrieval results. Most existing data fusion methods combine ranked lists from different retrieval functions for a single given query. But in many real search settings, the diversity of retrieval functions required to achieve good fusion performance is not available. Researchers are typically limited to a few variants on a scoring function used by the engine of their choice, with these variants often producing similar results due to being based on the same underlying term statistics. This paper presents a framework for data fusion based on combining ranked lists from different queries that users could have entered for their information need. If we can identify a set of \"possible queries\" for an information need, and estimate probability distributions concerning the probability of generating those queries, the probability of retrieving certain documents for those queries, and the probability of documents being relevant to that information need, we have the potential to dramatically improve results over a baseline system given a single user query. Our framework is based on several component models that can be mixed and matched. We present several simple estimation methods for components. In order to demonstrate effectiveness, we present experimental results on 5 different datasets covering tasks such as ad-hoc search, novelty and diversity search, and search in the presence of implicit user feedback. Our results show strong performances for our method; it is competitive with state-of-the-art methods on the same datasets, and in some cases outperforms them.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121792419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Study of Document Expansion using Translation Models and Dimensionality Reduction Methods 基于翻译模型和降维方法的文献扩展研究

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Pub Date : 2016-09-12 DOI: 10.1145/2970398.2970439

Saeid Balaneshinkordan, Alexander Kotov

Over a decade of research on document expansion methods resulted in several independent avenues, including smoothing methods, translation models, and dimensionality reduction techniques, such as matrix decompositions and topic models. Although these research avenues have been individually explored in many previous studies, there is still a lack of understanding of how state-of-the-art methods for each of these directions compare with each other in terms of retrieval accuracy. This paper eliminates this gap by reporting the results of an empirical comparison of document expansion methods using translation models estimated based on word co-occurrence and cosine similarity between low-dimensional word embeddings, Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), on standard TREC collections. Experimental results indicate that LDA-based document expansion consistently outperforms both types of translation models and NMF according to all evaluation metrics for all and difficult queries, which is closely followed by translation model using word embeddings.

十多年来对文档扩展方法的研究产生了几种独立的方法，包括平滑方法、翻译模型和降维技术，如矩阵分解和主题模型。尽管这些研究途径已经在许多先前的研究中单独探索过，但仍然缺乏对这些方向的最先进方法在检索精度方面的相互比较的理解。本文通过报告在标准TREC集合上使用基于低维词嵌入、潜在狄利克雷分配(LDA)和非负矩阵分解(NMF)之间的词共现和余弦相似性估计的翻译模型的文档扩展方法的经验比较结果来消除这一差距。实验结果表明，基于lda的文档扩展在所有和困难查询的所有评估指标上都优于两种翻译模型和NMF，紧随其后的是使用词嵌入的翻译模型。

{"title":"A Study of Document Expansion using Translation Models and Dimensionality Reduction Methods","authors":"Saeid Balaneshinkordan, Alexander Kotov","doi":"10.1145/2970398.2970439","DOIUrl":"https://doi.org/10.1145/2970398.2970439","url":null,"abstract":"Over a decade of research on document expansion methods resulted in several independent avenues, including smoothing methods, translation models, and dimensionality reduction techniques, such as matrix decompositions and topic models. Although these research avenues have been individually explored in many previous studies, there is still a lack of understanding of how state-of-the-art methods for each of these directions compare with each other in terms of retrieval accuracy. This paper eliminates this gap by reporting the results of an empirical comparison of document expansion methods using translation models estimated based on word co-occurrence and cosine similarity between low-dimensional word embeddings, Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), on standard TREC collections. Experimental results indicate that LDA-based document expansion consistently outperforms both types of translation models and NMF according to all evaluation metrics for all and difficult queries, which is closely followed by translation model using word embeddings.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116858507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀