Preference-based Evaluation Metrics for Web Image Search

Xiaohui Xie, Jiaxin Mao, Y. Liu, M. de Rijke, Haitian Chen, Min Zhang, Shaoping Ma
{"title":"Preference-based Evaluation Metrics for Web Image Search","authors":"Xiaohui Xie, Jiaxin Mao, Y. Liu, M. de Rijke, Haitian Chen, Min Zhang, Shaoping Ma","doi":"10.1145/3397271.3401146","DOIUrl":null,"url":null,"abstract":"Following the success of Cranfield-like evaluation approaches to evaluation in web search, web image search has also been evaluated with absolute judgments of (graded) relevance. However, recent research has found that collecting absolute relevance judgments may be difficult in image search scenarios due to the multi-dimensional nature of relevance for image results. Moreover, existing evaluation metrics based on absolute relevance judgments do not correlate well with search users' satisfaction perceptions in web image search. Unlike absolute relevance judgments, preference judgments do not require that relevance grades be pre-defined, i.e., how many levels to use and what those levels mean. Instead of considering each document in isolation, preference judgments consider a pair of documents and require judges to state their relative preference. Such preference judgments are usually more reliable than absolute judgments since the presence of (at least) two items establishes a certain context. While preference judgments have been studied extensively for general web search, there exists no thorough investigation on how preference judgments and preference-based evaluation metrics can be used to evaluate web image search systems. Compared to general web search, web image search may be an even better fit for preference-based evaluation because of its grid-based presentation style. The limited need for fresh results in web image search also makes preference judgments more reusable than for general web search. In this paper, we provide a thorough comparison of variants of preference judgments for web image search. We find that compared to strict preference judgments, weak preference judgments require less time and have better inter-assessor agreement. We also study how absolute relevance levels of two given images affect preference judgments between them. Furthermore, we propose a preference-based evaluation metric named Preference-Winning-Penalty (PWP) to evaluate and compare between two different image search systems. The proposed PWP metric outperforms existing evaluation metrics based on absolute relevance judgments in terms of agreement to system-level preferences of actual users.","PeriodicalId":252050,"journal":{"name":"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397271.3401146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Following the success of Cranfield-like evaluation approaches to evaluation in web search, web image search has also been evaluated with absolute judgments of (graded) relevance. However, recent research has found that collecting absolute relevance judgments may be difficult in image search scenarios due to the multi-dimensional nature of relevance for image results. Moreover, existing evaluation metrics based on absolute relevance judgments do not correlate well with search users' satisfaction perceptions in web image search. Unlike absolute relevance judgments, preference judgments do not require that relevance grades be pre-defined, i.e., how many levels to use and what those levels mean. Instead of considering each document in isolation, preference judgments consider a pair of documents and require judges to state their relative preference. Such preference judgments are usually more reliable than absolute judgments since the presence of (at least) two items establishes a certain context. While preference judgments have been studied extensively for general web search, there exists no thorough investigation on how preference judgments and preference-based evaluation metrics can be used to evaluate web image search systems. Compared to general web search, web image search may be an even better fit for preference-based evaluation because of its grid-based presentation style. The limited need for fresh results in web image search also makes preference judgments more reusable than for general web search. In this paper, we provide a thorough comparison of variants of preference judgments for web image search. We find that compared to strict preference judgments, weak preference judgments require less time and have better inter-assessor agreement. We also study how absolute relevance levels of two given images affect preference judgments between them. Furthermore, we propose a preference-based evaluation metric named Preference-Winning-Penalty (PWP) to evaluate and compare between two different image search systems. The proposed PWP metric outperforms existing evaluation metrics based on absolute relevance judgments in terms of agreement to system-level preferences of actual users.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于偏好的网络图像搜索评价指标
继类似克兰菲尔德的评价方法在网络搜索中的成功之后,网络图像搜索也被评价为绝对(分级)相关性判断。然而,最近的研究发现,由于图像结果的相关性具有多维性,因此在图像搜索场景中收集绝对相关性判断可能很困难。此外,现有的基于绝对相关性判断的评价指标与搜索用户在网络图像搜索中的满意度感知相关性不强。与绝对相关性判断不同,偏好判断不需要预先定义相关性等级,即使用多少级别以及这些级别意味着什么。偏好判断不是孤立地考虑每个文件,而是考虑一对文件,并要求法官陈述他们的相对偏好。这种偏好判断通常比绝对判断更可靠,因为(至少)两个项目的存在建立了特定的上下文。虽然偏好判断已经被广泛地用于一般的网络搜索,但对于如何使用偏好判断和基于偏好的评估指标来评估网络图像搜索系统,还没有深入的研究。与一般的网络搜索相比,网络图像搜索可能更适合基于偏好的评估,因为它基于网格的呈现风格。网络图像搜索对新结果的需求有限,这也使得偏好判断比一般的网络搜索更具可重用性。在本文中,我们提供了一个全面的比较的变体偏好判断的网络图像搜索。我们发现,与严格偏好判断相比,弱偏好判断所需的时间更少,并且评估者之间的一致性更好。我们还研究了两个给定图像的绝对相关水平如何影响它们之间的偏好判断。此外,我们提出了一个基于偏好的评价指标,称为偏好-获胜-惩罚(PWP),以评估和比较两种不同的图像搜索系统。建议的PWP度量优于现有的基于对实际用户的系统级偏好的绝对相关性判断的评估度量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
MHM: Multi-modal Clinical Data based Hierarchical Multi-label Diagnosis Prediction Correlated Features Synthesis and Alignment for Zero-shot Cross-modal Retrieval DVGAN Models Versus Satisfaction: Towards a Better Understanding of Evaluation Metrics Global Context Enhanced Graph Neural Networks for Session-based Recommendation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1