基于嵌入的查询语言模型

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval Pub Date : 2016-09-12 DOI:10.1145/2970398.2970405

Hamed Zamani, W. Bruce Croft

{"title":"基于嵌入的查询语言模型","authors":"Hamed Zamani, W. Bruce Croft","doi":"10.1145/2970398.2970405","DOIUrl":null,"url":null,"abstract":"Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"130","resultStr":"{\"title\":\"Embedding-based Query Language Models\",\"authors\":\"Hamed Zamani, W. Bruce Croft\",\"doi\":\"10.1145/2970398.2970405\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.\",\"PeriodicalId\":443715,\"journal\":{\"name\":\"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"130\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2970398.2970405\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2970398.2970405","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 130

摘要

词嵌入是词汇表术语的低维向量表示，可以捕获它们之间的语义相似性，最近在许多自然语言处理任务中显示出令人印象深刻的性能。然而，词嵌入在信息检索中的应用研究才刚刚开始。在本文中，我们探索了在特别检索任务中使用词嵌入来提高查询语言模型的准确性。为此，我们建议使用词嵌入来合并和加权不出现在查询中，但在语义上与查询术语相关的术语。我们用不同的假设描述了两个基于嵌入的查询扩展模型。由于伪相关反馈方法使用顶部检索的文档来更新原始查询模型是众所周知的有效方法，因此我们还开发了基于嵌入的相关模型，这是有效且鲁棒的相关模型方法的扩展。在这些模型中，我们将广泛使用的余弦相似度得到的相似度值与s型函数进行转换，得到更具判别性的语义相似度值。我们使用三个TREC新闻专线和网络集合来评估我们提出的方法。实验结果表明，在大多数情况下，基于嵌入的方法明显优于竞争基线。基于嵌入的方法也被证明比基线方法更健壮。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Embedding-based Query Language Models

Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

自引率

0.00%

发文量

期刊最新文献

A Simple and Effective Approach to Score Standardisation Understanding the Message of Images with Knowledge Base Traversals A Topical Approach to Retrievability Bias Estimation Efficient and Effective Higher Order Proximity Modeling Cross-Language Microblog Retrieval using Latent Semantic Modeling