Meta-evaluation of Conversational Search Evaluation Metrics

ACM Transactions on Information Systems (TOIS) Pub Date : 2021-04-27 DOI:10.1145/3445029

Zeyang Liu, K. Zhou, Max L. Wilson

{"title":"Meta-evaluation of Conversational Search Evaluation Metrics","authors":"Zeyang Liu, K. Zhou, Max L. Wilson","doi":"10.1145/3445029","DOIUrl":null,"url":null,"abstract":"Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.","PeriodicalId":6934,"journal":{"name":"ACM Transactions on Information Systems (TOIS)","volume":"1 1","pages":"1 - 42"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems (TOIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3445029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

会话搜索评价指标的元评价

会话式搜索系统，如Google assistant和Microsoft Cortana，使用户能够通过自然语言对话与搜索系统进行多轮交互。评估这样的系统是非常具有挑战性的，因为可以生成任何自然语言响应，并且用户通常会交互多个语义上连贯的回合来完成搜索任务。虽然先前的研究提出了许多评价指标，但这些措施如何有效地捕获用户偏好的程度仍有待调查。在本文中，我们系统地对各种会话搜索指标进行元评估。我们具体研究了这些指标的三个角度:(1)可靠性:检测“实际”性能差异的能力，而不是偶然观察到的差异;(2)保真度:符合最终用户偏好的能力;(3)直观性:捕捉任何被认为重要的属性的能力:在会话搜索的背景下，充分性、信息性和流畅性。通过对两个测试集的实验，我们发现不同指标在不同场景下的表现差异显著，而与先前的研究一致，现有指标与最终用户偏好和满意度仅实现弱相关性。流星是，相对而言，最好的单回合指标考虑到所有三个角度。我们还证明了适应的基于会话的评估指标可以用于测量多回合会话搜索，实现与用户满意度的适度一致性。据我们所知，我们的工作为会话搜索建立了迄今为止最全面的元评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Information Systems (TOIS)

自引率

0.00%

发文量