On Divergence Measures and Static Index Pruning

Proceedings of the 2015 International Conference on The Theory of Information Retrieval Pub Date : 2015-09-27 DOI:10.1145/2808194.2809472

Ruey-Cheng Chen, Chia-Jung Lee, W. Bruce Croft

引用次数: 6

Abstract

We study the problem of static index pruning in a renowned divergence minimization framework, using a range of divergence measures such as f-divergence and Rényi divergence as the objective. We show that many well-known divergence measures are convex in pruning decisions, and therefore can be exactly minimized using an efficient algorithm. Our approach allows postings be prioritized according to the amount of information they contribute to the index, and through specifying a different divergence measure the contribution is modeled on a different returns curve. In our experiment on GOV2 data, Rényi divergence of order infinity appears the most effective. This divergence measure significantly outperforms many standard methods and achieves identical retrieval effectiveness as full data using only 50% of the postings. When top-k precision is of the only concern, 10% of the data is sufficient to achieve the accuracy that one would usually expect from a full index.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

散度测度与静态指数修剪

我们在一个著名的散度最小化框架中研究了静态指数修剪问题，使用一系列的散度度量，如f散度和rsamunyi散度作为目标。我们证明了许多众所周知的散度度量在修剪决策中是凸的，因此可以使用有效的算法精确地最小化。我们的方法允许根据对指数贡献的信息量对帖子进行优先排序，并通过指定不同的发散度量，在不同的回报曲线上对贡献进行建模。在我们对GOV2数据的实验中，无限阶的rsamnyi散度是最有效的。这种差异度量明显优于许多标准方法，并且仅使用50%的帖子就实现了与完整数据相同的检索效率。当只考虑top-k精度时，10%的数据足以达到通常期望从完整索引获得的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2015 International Conference on The Theory of Information Retrieval

自引率

0.00%

发文量

期刊最新文献

Entity Linking in Queries: Tasks and Evaluation Using Part-of-Speech N-grams for Sensitive-Text Classification Query Expansion with Freebase Partially Labeled Supervised Topic Models for RetrievingSimilar Questions in CQA Forums Two Operators to Define and Manipulate Themes of a Document Collection