Meta-evaluation of Conversational Search Evaluation Metrics

Zeyang Liu, K. Zhou, Max L. Wilson
{"title":"Meta-evaluation of Conversational Search Evaluation Metrics","authors":"Zeyang Liu, K. Zhou, Max L. Wilson","doi":"10.1145/3445029","DOIUrl":null,"url":null,"abstract":"Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.","PeriodicalId":6934,"journal":{"name":"ACM Transactions on Information Systems (TOIS)","volume":"1 1","pages":"1 - 42"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems (TOIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3445029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
会话搜索评价指标的元评价
会话式搜索系统,如Google assistant和Microsoft Cortana,使用户能够通过自然语言对话与搜索系统进行多轮交互。评估这样的系统是非常具有挑战性的,因为可以生成任何自然语言响应,并且用户通常会交互多个语义上连贯的回合来完成搜索任务。虽然先前的研究提出了许多评价指标,但这些措施如何有效地捕获用户偏好的程度仍有待调查。在本文中,我们系统地对各种会话搜索指标进行元评估。我们具体研究了这些指标的三个角度:(1)可靠性:检测“实际”性能差异的能力,而不是偶然观察到的差异;(2)保真度:符合最终用户偏好的能力;(3)直观性:捕捉任何被认为重要的属性的能力:在会话搜索的背景下,充分性、信息性和流畅性。通过对两个测试集的实验,我们发现不同指标在不同场景下的表现差异显著,而与先前的研究一致,现有指标与最终用户偏好和满意度仅实现弱相关性。流星是,相对而言,最好的单回合指标考虑到所有三个角度。我们还证明了适应的基于会话的评估指标可以用于测量多回合会话搜索,实现与用户满意度的适度一致性。据我们所知,我们的工作为会话搜索建立了迄今为止最全面的元评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Collaborative Graph Learning for Session-based Recommendation GraphHINGE: Learning Interaction Models of Structured Neighborhood on Heterogeneous Information Network Scalable Representation Learning for Dynamic Heterogeneous Information Networks via Metagraphs Complex-valued Neural Network-based Quantum Language Models eFraudCom: An E-commerce Fraud Detection System via Competitive Graph Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1