首页 > 最新文献

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval最新文献

英文 中文
Exploiting Entity Linking in Queries for Entity Retrieval 利用实体链接查询实体检索
Faegheh Hasibi, K. Balog, Svein Erik Bratsberg
The premise of entity retrieval is to better answer search queries by returning specific entities instead of documents. Many queries mention particular entities; recognizing and linking them to the corresponding entry in a knowledge base is known as the task of entity linking in queries. In this paper we make a first attempt at bringing together these two, i.e., leveraging entity annotations of queries in the entity retrieval model. We introduce a new probabilistic component and show how it can be applied on top of any term-based entity retrieval model that can be emulated in the Markov Random Field framework, including language models, sequential dependence models, as well as their fielded variations. Using a standard entity retrieval test collection, we show that our extension brings consistent improvements over all baseline methods, including the current state-of-the-art. We further show that our extension is robust against parameter settings.
实体检索的前提是通过返回特定实体而不是文档来更好地回答搜索查询。许多查询提到了特定的实体;识别它们并将它们链接到知识库中的相应条目称为查询中的实体链接任务。在本文中,我们首次尝试将这两者结合起来,即在实体检索模型中利用查询的实体注释。我们引入了一个新的概率组件,并展示了如何将其应用于任何基于术语的实体检索模型之上,这些模型可以在马尔可夫随机场框架中进行模拟,包括语言模型、顺序依赖模型以及它们的领域变体。使用标准的实体检索测试集合,我们展示了我们的扩展在所有基线方法上带来了一致的改进,包括当前最先进的方法。我们进一步证明了我们的扩展对参数设置具有鲁棒性。
{"title":"Exploiting Entity Linking in Queries for Entity Retrieval","authors":"Faegheh Hasibi, K. Balog, Svein Erik Bratsberg","doi":"10.1145/2970398.2970406","DOIUrl":"https://doi.org/10.1145/2970398.2970406","url":null,"abstract":"The premise of entity retrieval is to better answer search queries by returning specific entities instead of documents. Many queries mention particular entities; recognizing and linking them to the corresponding entry in a knowledge base is known as the task of entity linking in queries. In this paper we make a first attempt at bringing together these two, i.e., leveraging entity annotations of queries in the entity retrieval model. We introduce a new probabilistic component and show how it can be applied on top of any term-based entity retrieval model that can be emulated in the Markov Random Field framework, including language models, sequential dependence models, as well as their fielded variations. Using a standard entity retrieval test collection, we show that our extension brings consistent improvements over all baseline methods, including the current state-of-the-art. We further show that our extension is robust against parameter settings.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129450631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 82
A Reproducibility Study of Information Retrieval Models 信息检索模型的再现性研究
Peilin Yang, Hui Fang
Developing effective information retrieval models has been a long standing challenge in Information Retrieval (IR), and significant progresses have been made over the years. With the increasing number of developed retrieval functions and the release of new data collections, it becomes more difficult, if not impossible, to compare a new retrieval function with all existing retrieval functions over all available data collections. To tackle thisproblem, this paper describes our efforts on constructing a platform that aims to improve the reproducibility of IR researchand facilitate the evaluation and comparison of retrieval functions. With the developed platform, more than 20 state of the art retrieval functions have been implemented and systematically evaluated over 16 standard TREC collections (including the newly released ClueWeb datasets). Our reproducibility study leads to several interesting observations. First, the performance difference between the reproduced results and those reported in the original papers is small for most retrieval functions. Second, the optimal performance of a few representative retrieval functions is still comparable over the new TREC ClueWeb collections. Finally, the developed platform (i.e., RISE) is made publicly available so that any IR researchers would be able to utilize it to evaluate other retrieval functions.
开发有效的信息检索模型是信息检索(information retrieval, IR)领域长期面临的挑战,近年来已经取得了重大进展。随着开发的检索函数数量的增加和新数据集合的发布,将新检索函数与所有可用数据集合上的所有现有检索函数进行比较变得更加困难(如果不是不可能的话)。为了解决这一问题,本文描述了我们构建一个平台的努力,该平台旨在提高IR研究的可重复性,并便于检索功能的评估和比较。通过开发的平台,已经实现了20多个最先进的检索功能,并系统地评估了超过16个标准TREC集合(包括新发布的ClueWeb数据集)。我们的可重复性研究得出了几个有趣的观察结果。首先,对于大多数检索功能,复制结果与原始论文中报告的结果之间的性能差异很小。其次,一些代表性检索函数的最佳性能仍然可以与新的TREC ClueWeb集合相媲美。最后,开发的平台(即RISE)是公开可用的,以便任何IR研究人员都能够利用它来评估其他检索功能。
{"title":"A Reproducibility Study of Information Retrieval Models","authors":"Peilin Yang, Hui Fang","doi":"10.1145/2970398.2970415","DOIUrl":"https://doi.org/10.1145/2970398.2970415","url":null,"abstract":"Developing effective information retrieval models has been a long standing challenge in Information Retrieval (IR), and significant progresses have been made over the years. With the increasing number of developed retrieval functions and the release of new data collections, it becomes more difficult, if not impossible, to compare a new retrieval function with all existing retrieval functions over all available data collections. To tackle thisproblem, this paper describes our efforts on constructing a platform that aims to improve the reproducibility of IR researchand facilitate the evaluation and comparison of retrieval functions. With the developed platform, more than 20 state of the art retrieval functions have been implemented and systematically evaluated over 16 standard TREC collections (including the newly released ClueWeb datasets). Our reproducibility study leads to several interesting observations. First, the performance difference between the reproduced results and those reported in the original papers is small for most retrieval functions. Second, the optimal performance of a few representative retrieval functions is still comparable over the new TREC ClueWeb collections. Finally, the developed platform (i.e., RISE) is made publicly available so that any IR researchers would be able to utilize it to evaluate other retrieval functions.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126569781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
A Study of Document Expansion using Translation Models and Dimensionality Reduction Methods 基于翻译模型和降维方法的文献扩展研究
Saeid Balaneshinkordan, Alexander Kotov
Over a decade of research on document expansion methods resulted in several independent avenues, including smoothing methods, translation models, and dimensionality reduction techniques, such as matrix decompositions and topic models. Although these research avenues have been individually explored in many previous studies, there is still a lack of understanding of how state-of-the-art methods for each of these directions compare with each other in terms of retrieval accuracy. This paper eliminates this gap by reporting the results of an empirical comparison of document expansion methods using translation models estimated based on word co-occurrence and cosine similarity between low-dimensional word embeddings, Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), on standard TREC collections. Experimental results indicate that LDA-based document expansion consistently outperforms both types of translation models and NMF according to all evaluation metrics for all and difficult queries, which is closely followed by translation model using word embeddings.
十多年来对文档扩展方法的研究产生了几种独立的方法,包括平滑方法、翻译模型和降维技术,如矩阵分解和主题模型。尽管这些研究途径已经在许多先前的研究中单独探索过,但仍然缺乏对这些方向的最先进方法在检索精度方面的相互比较的理解。本文通过报告在标准TREC集合上使用基于低维词嵌入、潜在狄利克雷分配(LDA)和非负矩阵分解(NMF)之间的词共现和余弦相似性估计的翻译模型的文档扩展方法的经验比较结果来消除这一差距。实验结果表明,基于lda的文档扩展在所有和困难查询的所有评估指标上都优于两种翻译模型和NMF,紧随其后的是使用词嵌入的翻译模型。
{"title":"A Study of Document Expansion using Translation Models and Dimensionality Reduction Methods","authors":"Saeid Balaneshinkordan, Alexander Kotov","doi":"10.1145/2970398.2970439","DOIUrl":"https://doi.org/10.1145/2970398.2970439","url":null,"abstract":"Over a decade of research on document expansion methods resulted in several independent avenues, including smoothing methods, translation models, and dimensionality reduction techniques, such as matrix decompositions and topic models. Although these research avenues have been individually explored in many previous studies, there is still a lack of understanding of how state-of-the-art methods for each of these directions compare with each other in terms of retrieval accuracy. This paper eliminates this gap by reporting the results of an empirical comparison of document expansion methods using translation models estimated based on word co-occurrence and cosine similarity between low-dimensional word embeddings, Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), on standard TREC collections. Experimental results indicate that LDA-based document expansion consistently outperforms both types of translation models and NMF according to all evaluation metrics for all and difficult queries, which is closely followed by translation model using word embeddings.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116858507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Nearest Neighbour based Transformation Functions for Text Classification: A Case Study with StackOverflow 基于最近邻的文本分类转换函数:基于StackOverflow的案例研究
Piyush Arora, Debasis Ganguly, G. Jones
significant increase in the number of questions in question answering forums has led to the interest in text categorization methods for classifying a newly posted question as good (suitable) or bad (otherwise) for the forum. Standard text categorization approaches, e.g. multinomial Naive Bayes, are likely to be unsuitable for this classification task because of: i) the lack of sufficient informative content in the questions due to their relatively short length; and ii) considerable vocabulary overlap between the classes. To increase the robustness of this classification task, we propose to use the neighbourhood of existing questions which are similar to the newly asked question. Instead of learning the classification boundary from the questions alone, we transform each question vector into a different one in the feature space. We explore two different neighbourhood functions using: the discrete term space, the continuous vector space of real numbers obtained from vector embeddings of documents. Experiments conducted on StackOverflow data show that our approach of using the neighborhood transformation can improve classification accuracy by up to about 8%.
问答论坛中问题数量的显著增加导致了对文本分类方法的兴趣,用于将新发布的问题分类为论坛的好(合适)或坏(否则)。标准的文本分类方法,如多项朴素贝叶斯,可能不适合这个分类任务,因为:i)由于问题的长度相对较短,缺乏足够的信息内容;ii)两类之间有相当多的词汇重叠。为了提高该分类任务的鲁棒性,我们建议使用与新问题相似的现有问题的邻域。我们不是单独从问题中学习分类边界,而是将每个问题向量转换为特征空间中的不同向量。我们使用两个不同的邻域函数:离散项空间,实数的连续向量空间,从文档的向量嵌入中获得。在StackOverflow数据上进行的实验表明,我们使用邻域变换的方法可以将分类精度提高8%左右。
{"title":"Nearest Neighbour based Transformation Functions for Text Classification: A Case Study with StackOverflow","authors":"Piyush Arora, Debasis Ganguly, G. Jones","doi":"10.1145/2970398.2970426","DOIUrl":"https://doi.org/10.1145/2970398.2970426","url":null,"abstract":"significant increase in the number of questions in question answering forums has led to the interest in text categorization methods for classifying a newly posted question as good (suitable) or bad (otherwise) for the forum. Standard text categorization approaches, e.g. multinomial Naive Bayes, are likely to be unsuitable for this classification task because of: i) the lack of sufficient informative content in the questions due to their relatively short length; and ii) considerable vocabulary overlap between the classes. To increase the robustness of this classification task, we propose to use the neighbourhood of existing questions which are similar to the newly asked question. Instead of learning the classification boundary from the questions alone, we transform each question vector into a different one in the feature space. We explore two different neighbourhood functions using: the discrete term space, the continuous vector space of real numbers obtained from vector embeddings of documents. Experiments conducted on StackOverflow data show that our approach of using the neighborhood transformation can improve classification accuracy by up to about 8%.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128383893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Utility Maximization Framework for Privacy Preservation of User Generated Content 用户生成内容隐私保护的效用最大化框架
Yi Fang, Archana Godavarthy, Haibing Lu
The prodigious amount of user-generated content continues to grow at an enormous rate. While it greatly facilitates the flow of information and ideas among people and communities, it may pose great threat to our individual privacy. In this paper, we demonstrate that the private traits of individuals can be inferred from user-generated content by using text classification techniques. Specifically, we study three private attributes on Twitter users: religion, political leaning, and marital status. The ground truth labels of the private traits can be readily collected from the Twitter bio field. Based on the tweets posted by the users and their corresponding bios, we show that text classification yields a high accuracy of identification of these personal attributes, which poses a great privacy risk on user-generated content. We further propose a constrained utility maximization framework for preserving user privacy. The goal is to maximize the utility of data when modifying the user-generated content, while degrading the prediction performance of the adversary. The KL divergence is minimized between the prior knowledge about the private attribute and the posterior probability after seeing the user-generated data. Based on this proposed framework, we investigate several specific data sanitization operations for privacy preservation: add, delete, or replace words in the tweets. We derive the exact transformation of the data under each operation. The experiments demonstrate the effectiveness of the proposed framework.
大量的用户生成内容继续以惊人的速度增长。虽然它极大地促进了人们和社区之间信息和思想的流动,但它可能对我们的个人隐私构成巨大威胁。在本文中,我们证明了个人的隐私特征可以通过使用文本分类技术从用户生成的内容中推断出来。具体来说,我们研究了Twitter用户的三个私人属性:宗教信仰、政治倾向和婚姻状况。私人特征的基本真相标签可以很容易地从Twitter的生物字段中收集到。基于用户发布的推文及其对应的bios,我们表明文本分类对这些个人属性的识别准确率很高,这对用户生成的内容构成了很大的隐私风险。我们进一步提出了一个约束效用最大化框架来保护用户隐私。目标是在修改用户生成的内容时最大限度地利用数据,同时降低对手的预测性能。在看到用户生成的数据后,关于私有属性的先验知识和后验概率之间的KL分歧被最小化。基于该框架,我们研究了几种用于隐私保护的特定数据处理操作:添加、删除或替换tweet中的单词。我们推导出每个操作下数据的精确变换。实验证明了该框架的有效性。
{"title":"A Utility Maximization Framework for Privacy Preservation of User Generated Content","authors":"Yi Fang, Archana Godavarthy, Haibing Lu","doi":"10.1145/2970398.2970417","DOIUrl":"https://doi.org/10.1145/2970398.2970417","url":null,"abstract":"The prodigious amount of user-generated content continues to grow at an enormous rate. While it greatly facilitates the flow of information and ideas among people and communities, it may pose great threat to our individual privacy. In this paper, we demonstrate that the private traits of individuals can be inferred from user-generated content by using text classification techniques. Specifically, we study three private attributes on Twitter users: religion, political leaning, and marital status. The ground truth labels of the private traits can be readily collected from the Twitter bio field. Based on the tweets posted by the users and their corresponding bios, we show that text classification yields a high accuracy of identification of these personal attributes, which poses a great privacy risk on user-generated content. We further propose a constrained utility maximization framework for preserving user privacy. The goal is to maximize the utility of data when modifying the user-generated content, while degrading the prediction performance of the adversary. The KL divergence is minimized between the prior knowledge about the private attribute and the posterior probability after seeing the user-generated data. Based on this proposed framework, we investigate several specific data sanitization operations for privacy preservation: add, delete, or replace words in the tweets. We derive the exact transformation of the data under each operation. The experiments demonstrate the effectiveness of the proposed framework.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134371509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Utilizing Knowledge Bases in Text-centric Information Retrieval 知识库在以文本为中心的信息检索中的应用
Laura Dietz, Alexander Kotov, E. Meij
General-purpose knowledge bases are increasingly growing in terms of depth (content) and width (coverage). Moreover, algorithms for entity linking and entity retrieval have improved tremendously in the past years. These developments give rise to a new line of research that exploits and combines these developments for the purposes of text-centric information retrieval applications. This tutorial focuses on a) how to retrieve a set of entities for an ad-hoc query, or more broadly, assessing relevance of KB elements for the information need, b) how to annotate text with such elements, and c) how to use this information to assess the relevance of text. We discuss different kinds of information available in a knowledge graph and how to leverage each most effectively. We start the tutorial with a brief overview of different types of knowledge bases, their structure and information contained in popular general-purpose and domain-specific knowledge bases. In particular, we focus on the representation of entity-centric information in the knowledge base through names, terms, relations, and type taxonomies. Next, we will provide a recap on ad-hoc object retrieval from knowledge graphs as well as entity linking and retrieval. This is essential technology, which the remainder of the tutorial builds on. Next we will cover essential components within successful entity linking systems, including the collection of entity name information and techniques for disambiguation with contextual entity mentions. We will present the details of four previously proposed systems that successfully leverage knowledge bases to improve ad-hoc document retrieval. These systems combine the notion of entity retrieval and semantic search on one hand, with text retrieval models and entity linking on the other. Finally, we also touch on entity aspects and links in the knowledge graph as it can help to understand the entities' context. This tutorial is the first to compile, summarize, and disseminate progress in this emerging area and we provide both an overview of state-of-the-art methods and outline open research problems to encourage new contributions.
通用知识库在深度(内容)和宽度(覆盖范围)方面日益增长。此外,实体链接和实体检索的算法在过去几年中有了巨大的改进。这些发展产生了一条新的研究路线,利用并结合这些发展来实现以文本为中心的信息检索应用程序。本教程主要关注a)如何检索一组实体用于特定查询,或者更广泛地说,评估所需信息的KB元素的相关性,b)如何用这些元素对文本进行注释,以及c)如何使用这些信息来评估文本的相关性。我们将讨论知识图中可用的不同类型的信息,以及如何最有效地利用每种信息。我们首先简要概述了不同类型的知识库、它们的结构以及流行的通用和特定于领域的知识库中包含的信息。我们特别关注通过名称、术语、关系和类型分类法在知识库中表示以实体为中心的信息。接下来,我们将简要介绍从知识图中检索特定对象以及实体链接和检索。这是必不可少的技术,本教程的其余部分将以此为基础。接下来,我们将介绍成功的实体链接系统中的基本组件,包括实体名称信息的收集以及与上下文实体提及消除歧义的技术。我们将详细介绍先前提出的四个系统,它们成功地利用知识库来改进临时文档检索。这些系统一方面结合了实体检索和语义搜索的概念,另一方面结合了文本检索模型和实体链接。最后,我们还涉及知识图中的实体方面和链接,因为它可以帮助理解实体的上下文。本教程是第一个编译、总结和传播这一新兴领域进展的教程,我们提供了最先进方法的概述,并概述了开放的研究问题,以鼓励新的贡献。
{"title":"Utilizing Knowledge Bases in Text-centric Information Retrieval","authors":"Laura Dietz, Alexander Kotov, E. Meij","doi":"10.1145/2970398.2970441","DOIUrl":"https://doi.org/10.1145/2970398.2970441","url":null,"abstract":"General-purpose knowledge bases are increasingly growing in terms of depth (content) and width (coverage). Moreover, algorithms for entity linking and entity retrieval have improved tremendously in the past years. These developments give rise to a new line of research that exploits and combines these developments for the purposes of text-centric information retrieval applications. This tutorial focuses on a) how to retrieve a set of entities for an ad-hoc query, or more broadly, assessing relevance of KB elements for the information need, b) how to annotate text with such elements, and c) how to use this information to assess the relevance of text. We discuss different kinds of information available in a knowledge graph and how to leverage each most effectively. We start the tutorial with a brief overview of different types of knowledge bases, their structure and information contained in popular general-purpose and domain-specific knowledge bases. In particular, we focus on the representation of entity-centric information in the knowledge base through names, terms, relations, and type taxonomies. Next, we will provide a recap on ad-hoc object retrieval from knowledge graphs as well as entity linking and retrieval. This is essential technology, which the remainder of the tutorial builds on. Next we will cover essential components within successful entity linking systems, including the collection of entity name information and techniques for disambiguation with contextual entity mentions. We will present the details of four previously proposed systems that successfully leverage knowledge bases to improve ad-hoc document retrieval. These systems combine the notion of entity retrieval and semantic search on one hand, with text retrieval models and entity linking on the other. Finally, we also touch on entity aspects and links in the knowledge graph as it can help to understand the entities' context. This tutorial is the first to compile, summarize, and disseminate progress in this emerging area and we provide both an overview of state-of-the-art methods and outline open research problems to encourage new contributions.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116122292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Learning to Rank with Labeled Features 学习用标记的特征排序
Fernando Diaz
Classic learning to rank algorithms are trained using a set of labeled documents, pairs of documents, or rankings of documents. Unfortunately, in many situations, gathering such labels requires significant overhead in terms of time and money. We present an algorithm for training a learning to rank model using a set of labeled features elicited from system designers or domain experts. Labeled features incorporate a system designer's belief about the correlation between certain features and relative relevance. We demonstrate the efficacy of our model on a public learning to rank dataset. Our results show that we outperform our baselines even when using as little as a single feature label.
经典的学习排序算法是使用一组标记文档、文档对或文档排名来训练的。不幸的是,在许多情况下,收集这些标签需要大量的时间和金钱开销。我们提出了一种算法,使用从系统设计者或领域专家那里获得的一组标记特征来训练学习排序模型。标记的特征结合了系统设计师关于某些特征和相对相关性之间的相关性的信念。我们在一个公共学习排序数据集上展示了我们的模型的有效性。我们的结果表明,即使使用单个特征标签,我们的表现也优于基线。
{"title":"Learning to Rank with Labeled Features","authors":"Fernando Diaz","doi":"10.1145/2970398.2970435","DOIUrl":"https://doi.org/10.1145/2970398.2970435","url":null,"abstract":"Classic learning to rank algorithms are trained using a set of labeled documents, pairs of documents, or rankings of documents. Unfortunately, in many situations, gathering such labels requires significant overhead in terms of time and money. We present an algorithm for training a learning to rank model using a set of labeled features elicited from system designers or domain experts. Labeled features incorporate a system designer's belief about the correlation between certain features and relative relevance. We demonstrate the efficacy of our model on a public learning to rank dataset. Our results show that we outperform our baselines even when using as little as a single feature label.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121754700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
EventMiner: Mining Events from Annotated Documents EventMiner:从注释文档中挖掘事件
Dhruv Gupta, Jannik Strotgen, K. Berberich
Events are central in human history and thus also in Web queries, in particular if they relate to history or news. However, ambiguity issues arise as queries may refer to ambiguous events differing in time, geography, or participating entities. Thus, users would greatly benefit if search results were presented along different events. In this paper, we present EventMiner, an algorithm that mines events from top-k pseudo-relevant documents for a given query. It is a probabilistic framework that leverages semantic annotations in the form of temporal expressions, geographic locations, and named entities to analyze natural language text and determine important events. Using a large news corpus, we show that using semantic annotations, EventMiner detects important events and presents documents covering the identified events in the order of their importance.
事件是人类历史的中心,因此也是Web查询的中心,特别是与历史或新闻相关的事件。但是,由于查询可能引用在时间、地理位置或参与实体上不同的模糊事件,因此会出现歧义问题。因此,如果搜索结果按照不同的事件呈现,用户将受益匪浅。在本文中,我们提出了EventMiner,一种从给定查询的top-k伪相关文档中挖掘事件的算法。它是一个概率框架,利用时态表达式、地理位置和命名实体形式的语义注释来分析自然语言文本并确定重要事件。使用大型新闻语料库,我们展示了使用语义注释,EventMiner检测重要事件,并按照其重要性的顺序呈现涵盖已识别事件的文档。
{"title":"EventMiner: Mining Events from Annotated Documents","authors":"Dhruv Gupta, Jannik Strotgen, K. Berberich","doi":"10.1145/2970398.2970411","DOIUrl":"https://doi.org/10.1145/2970398.2970411","url":null,"abstract":"Events are central in human history and thus also in Web queries, in particular if they relate to history or news. However, ambiguity issues arise as queries may refer to ambiguous events differing in time, geography, or participating entities. Thus, users would greatly benefit if search results were presented along different events. In this paper, we present EventMiner, an algorithm that mines events from top-k pseudo-relevant documents for a given query. It is a probabilistic framework that leverages semantic annotations in the form of temporal expressions, geographic locations, and named entities to analyze natural language text and determine important events. Using a large news corpus, we show that using semantic annotations, EventMiner detects important events and presents documents covering the identified events in the order of their importance.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127068090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Analysis of the Paragraph Vector Model for Information Retrieval 面向信息检索的段落向量模型分析
Qingyao Ai, Liu Yang, Jiafeng Guo, W. Bruce Croft
Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.
以往的研究表明,通过神经嵌入模型可以获得词和文本的语义意义表征。特别是段落向量(PV)模型通过估计文档(主题)级语言模型在一些自然语言处理任务中表现出令人印象深刻的性能。然而,将PV模型与传统的语言模型方法集成在一起进行检索,会产生不稳定的性能和有限的改进。在本文中,我们正式讨论了原PV模型在检索任务中限制其性能的三个内在问题。我们还描述了对模型的修改,使其更适合红外任务,并通过实验和案例研究展示了它们的影响。我们解决的三个问题是:(1)不规范的PV训练过程容易产生短文档过拟合,从而在最终的检索模型中产生长度偏差;(2)基于语料库的PV负抽样导致单词权重方案过度抑制了频繁词的重要性;(3)由于缺乏词-上下文信息,使得PV无法捕捉词替换关系。
{"title":"Analysis of the Paragraph Vector Model for Information Retrieval","authors":"Qingyao Ai, Liu Yang, Jiafeng Guo, W. Bruce Croft","doi":"10.1145/2970398.2970409","DOIUrl":"https://doi.org/10.1145/2970398.2970409","url":null,"abstract":"Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123629949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
Embedding-based Query Language Models 基于嵌入的查询语言模型
Hamed Zamani, W. Bruce Croft
Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.
词嵌入是词汇表术语的低维向量表示,可以捕获它们之间的语义相似性,最近在许多自然语言处理任务中显示出令人印象深刻的性能。然而,词嵌入在信息检索中的应用研究才刚刚开始。在本文中,我们探索了在特别检索任务中使用词嵌入来提高查询语言模型的准确性。为此,我们建议使用词嵌入来合并和加权不出现在查询中,但在语义上与查询术语相关的术语。我们用不同的假设描述了两个基于嵌入的查询扩展模型。由于伪相关反馈方法使用顶部检索的文档来更新原始查询模型是众所周知的有效方法,因此我们还开发了基于嵌入的相关模型,这是有效且鲁棒的相关模型方法的扩展。在这些模型中,我们将广泛使用的余弦相似度得到的相似度值与s型函数进行转换,得到更具判别性的语义相似度值。我们使用三个TREC新闻专线和网络集合来评估我们提出的方法。实验结果表明,在大多数情况下,基于嵌入的方法明显优于竞争基线。基于嵌入的方法也被证明比基线方法更健壮。
{"title":"Embedding-based Query Language Models","authors":"Hamed Zamani, W. Bruce Croft","doi":"10.1145/2970398.2970405","DOIUrl":"https://doi.org/10.1145/2970398.2970405","url":null,"abstract":"Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115770565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 130
期刊
Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1