Which Diversity Evaluation Measures Are "Good"?

T. Sakai, Zhaohao Zeng
{"title":"Which Diversity Evaluation Measures Are \"Good\"?","authors":"T. Sakai, Zhaohao Zeng","doi":"10.1145/3331184.3331215","DOIUrl":null,"url":null,"abstract":"This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually align with users' preferences. The gold preferences were contructed by hiring 15 assessors, who independently examined 1,127 SERP pairs and made preference assessments. Two sets of preference assessments were obtained: one based on a relevance question \"Which SERP is more relevant?'' and the other based on a diversity question \"Which SERP is likely to satisfy a higher number of users?'' To our knowledge, our study is the first to have collected diversity preference assessments in this way and evaluated diversity measures successfully. Our main results are that (a) Popular adhoc IR measures such as nDCG actually align quite well with the gold relevance preferences; and that (b) While the ♯-measures align well with the gold diversity preferences, intent-aware measures perform relatively poorly. Moreover, as by-products of our analysis of existing evaluation measures, we define new adhoc measures called iRBU (intentwise Rank-Biased Utility) and EBR (Expected Blended Ratio); we demonstrate that an instance of iRBU performs as well as nDCG when compared to the gold relevance preferences. On the other hand, the original RBU, a recently-proposed diversity measure, underperforms the best ♯-measures when compared to the gold diversity preferences.","PeriodicalId":20700,"journal":{"name":"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3331184.3331215","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36

Abstract

This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually align with users' preferences. The gold preferences were contructed by hiring 15 assessors, who independently examined 1,127 SERP pairs and made preference assessments. Two sets of preference assessments were obtained: one based on a relevance question "Which SERP is more relevant?'' and the other based on a diversity question "Which SERP is likely to satisfy a higher number of users?'' To our knowledge, our study is the first to have collected diversity preference assessments in this way and evaluated diversity measures successfully. Our main results are that (a) Popular adhoc IR measures such as nDCG actually align quite well with the gold relevance preferences; and that (b) While the ♯-measures align well with the gold diversity preferences, intent-aware measures perform relatively poorly. Moreover, as by-products of our analysis of existing evaluation measures, we define new adhoc measures called iRBU (intentwise Rank-Biased Utility) and EBR (Expected Blended Ratio); we demonstrate that an instance of iRBU performs as well as nDCG when compared to the gold relevance preferences. On the other hand, the original RBU, a recently-proposed diversity measure, underperforms the best ♯-measures when compared to the gold diversity preferences.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
哪些多样性评估措施是“好的”?
本研究评估了30个IR评估措施或其实例,其中9个用于特殊IR, 21个用于多样化IR,主要是从他们对一个SERP(搜索引擎结果页面)的偏好是否与另一个用户的偏好相一致的角度出发。黄金偏好由15名评估人员构建,他们独立检查了1,127对SERP并进行了偏好评估。获得了两组偏好评估:一组基于相关性问题“哪个SERP更相关?”,另一个则是基于一个多样性问题:“哪个SERP可能满足更多的用户?”“据我们所知,我们的研究是第一个以这种方式收集多样性偏好评估并成功评估多样性措施的研究。我们的主要结果是:(a)流行的临时IR指标,如nDCG,实际上与黄金相关性偏好相当一致;并且(b)虽然# -措施与黄金多样性偏好很好地一致,但意图意识措施表现相对较差。此外,作为我们对现有评价措施分析的副产品,我们定义了新的特别措施,称为iRBU(故意秩偏效用)和EBR(预期混合比率);我们证明,与黄金相关偏好相比,iRBU实例的表现与nDCG一样好。另一方面,与黄金多样性偏好相比,最初的RBU,最近提出的多样性指标,表现不如最佳# -指标。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automatic Task Completion Flows from Web APIs Session details: Session 6A: Social Media Sequence and Time Aware Neighborhood for Session-based Recommendations: STAN Adversarial Training for Review-Based Recommendations Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1