Verboseness Fission for BM25 Document Length Normalization

Proceedings of the 2015 International Conference on The Theory of Information Retrieval Pub Date : 2015-09-27 DOI:10.1145/2808194.2809486

Aldo Lipani, M. Lupu, A. Hanbury, Akiko Aizawa

引用次数: 18

Abstract

BM25 is probably the most well known term weighting model in Information Retrieval. It has, depending on the formula variant at hand, 2 or 3 parameters (k1, b, and k3). This paper addresses b - the document length normalization parameter. Based on the observation that the two cases previously discussed for length normalization (multi-topicality and verboseness) are actually three: multi-topicality, verboseness with word repetition (repetitiveness) and verboseness with synonyms, we propose and test a new length normalization method that removes the need for a b parameter in BM25. Testing the new method on a set of purposefully varied test collections, we observe that we can obtain results statistically indistinguishable from the optimal results, therefore removing the need for ground-truth based optimization.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BM25文档长度规范化的冗余裂变

BM25可能是信息检索中最著名的术语加权模型。根据手头的公式变体，它有2或3个参数(k1, b和k3)。本文讨论了文档长度规范化参数b。基于前面讨论的长度规范化的两种情况(多主题性和冗长性)实际上是三种情况:多主题性、单词重复的冗长性(重复性)和同义词的冗长性，我们提出并测试了一种新的长度规范化方法，该方法在BM25中不需要b参数。在一组有目的地变化的测试集合上测试新方法，我们观察到我们可以获得与最优结果在统计上无法区分的结果，因此无需基于真值的优化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊