排列索引:从大型语料库快速近似检索

Proceedings of the 22nd ACM international conference on Information & Knowledge Management Pub Date : 2013-10-27 DOI:10.1145/2505515.2505646

M. Gurevich, Tamás Sarlós

{"title":"排列索引:从大型语料库快速近似检索","authors":"M. Gurevich, Tamás Sarlós","doi":"10.1145/2505515.2505646","DOIUrl":null,"url":null,"abstract":"Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Permutation indexing: fast approximate retrieval from large corpora\",\"authors\":\"M. Gurevich, Tamás Sarlós\",\"doi\":\"10.1145/2505515.2505646\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.\",\"PeriodicalId\":20528,\"journal\":{\"name\":\"Proceedings of the 22nd ACM international conference on Information & Knowledge Management\",\"volume\":\"34 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 22nd ACM international conference on Information & Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2505515.2505646\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2505515.2505646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

倒排索引在包括网络搜索在内的检索系统中是一种普遍使用的技术。尽管它很受欢迎，但它有一个缺点，即查询检索时间是高度可变的，并且随着语料库的大小而增长。在这项工作中，我们提出了一种替代技术，排列索引，其中检索成本是严格限定的，并且仅对语料库大小有对数依赖。我们的方法基于两种新技术:(a)将术语空间划分为查询中经常出现的重叠术语簇，以及(b)将集群中由术语组成的所有查询的结果紧凑编码为连续的文档id序列的数据结构。然后，通过获取这些序列的一小部分来检索查询结果。这是有代价的:我们的编码是有损的，因此返回近似的结果集。返回的真实结果的比例，召回率，是由冗余级别控制的。为排列索引分配的空间越多，召回率就越高。我们在理论上分析了简化文档和查询模型下的排列索引，并在实际文档和查询集合上进行了经验分析。我们表明，虽然排列索引不能取代传统的检索方法，因为不能保证对所有查询都有高召回率，但它覆盖了高达77%的尾部查询，可以用来加快这些查询的检索速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Permutation indexing: fast approximate retrieval from large corpora

Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助