A New Perspective on Score Standardization

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2019-07-18 DOI:10.1145/3331184.3331315

Julián Urbano, Harlley Lima, A. Hanjalic

引用次数: 10

Abstract

In test collection based evaluation of IR systems, score standardization has been proposed to compare systems across collections and minimize the effect of outlier runs on specific topics. The underlying idea is to account for the difficulty of topics, so that systems are scored relative to it. Webber et al. first proposed standardization through a non-linear transformation with the standard normal distribution, and recently Sakai proposed a simple linear transformation. In this paper, we show that both approaches are actually special cases of a simple standardization which assumes specific distributions for the per-topic scores. From this viewpoint, we argue that a transformation based on the empirical distribution is the most appropriate choice for this kind of standardization. Through a series of experiments on TREC data, we show the benefits of our proposal in terms of score stability and statistical test behavior.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

分数标准化的新视角

在基于测试集合的IR系统评估中，已经提出了分数标准化来比较不同集合的系统，并尽量减少在特定主题上异常运行的影响。潜在的想法是考虑到题目的难度，这样系统就可以根据题目的难度来评分。Webber等人首先通过标准正态分布的非线性变换提出了标准化，最近Sakai提出了简单的线性变换。在本文中，我们表明这两种方法实际上都是简单标准化的特殊情况，它假设每个主题分数的特定分布。从这个角度出发，我们认为基于经验分布的转换是这种标准化最合适的选择。通过对TREC数据的一系列实验，我们证明了我们的提议在分数稳定性和统计测试行为方面的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量

期刊最新文献

Automatic Task Completion Flows from Web APIs Session details: Session 6A: Social Media Sequence and Time Aware Neighborhood for Session-based Recommendations: STAN Adversarial Training for Review-Based Recommendations Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation