Information retrieval evaluation with humans in the loop

G. Kazai
{"title":"Information retrieval evaluation with humans in the loop","authors":"G. Kazai","doi":"10.1145/2637002.2637003","DOIUrl":null,"url":null,"abstract":"The evaluation and tuning of information retrieval (IR) systems based on the Cranfield paradigm requires purpose built test collections, which include sets of human contributed relevance labels, indicating the relevance of search results to a set of user queries. Traditional methods of collecting relevance labels rely on a fixed group of hired expert judges, who are trained to interpret user queries as accurately as possible and label documents accordingly. Human judges and the obtained relevance labels thus provide a critical link within the Cranfield style IR evaluation framework, where disagreement among judges and the impact of variable judgment sets on the final outcome of an evaluation is a well studied issue. There is also reported evidence that experiment outcomes can be affected by changes to the judging guidelines or changes in the judge population. Recently, the growing volume and diversity of the topics and documents to be judged is driving the increased adoption of crowdsourcing methods in IR evaluation, offering a viable alternative that scales with modest costs. In this model, relevance judgments are distributed online over a large population of humans, a crowd, facilitated, for example, by a crowdsourcing platform, such as Amazon's Mechanical Turk or Clickworker. Such platforms allow millions of anonymous crowd workers to be hired temporarily for micro-payments to complete so-called human intelligence tasks (HITs), such as labeling images or documents. Studies have shown that workers come from diverse backgrounds, work in a variety of different environments, and have different motivations. For example, users may turn to crowdsourcing as a way to make a living, to serve an altruistic or social purpose or simply to fill their time. They may become loyal crowd workers on one or more platforms, or they may leave after their first couple of encounters. Clearly, such a model is in stark contrast to the highly controlled methods that characterize the work of trained judges. For example, in a micro-task based crowdsourcing setup, worker training is usually minimal or non-existent. Furthermore, it is widely reported that labels provided by crowd workers can vary in quality, leading to noisy labels. Crowdsourcing can also suffer from undesirable worker behaviour and practices, e.g., dishonest behaviour or lack of expertise, that result in low quality contributions. While a range of quality assurance and control techniques have now been developed to reduce noise during or after task completion, little is known about the workers themselves and possible relationships between workers' characteristics, behaviour and the quality of their work. In this talk, I will review the findings of recent research that examines and compares trained judges and crowd workers hired to complete relevance assessment tasks of varying difficulty. The investigations include a range of aspects from how HIT design, judging instructions, worker demographics and characteristics may impact work quality. The main focus of the talk will be on experiments aimed to uncover characteristics of the crowd by monitoring their behaviour during different relevance assessment tasks, and compare them to professional judges' behaviour on the same tasks. Throughout the talk I will highlight challenges of quality assurance and control in crowdsourcing and propose a possible direction for solving the issue without relying on gold standard data sets, which are expensive to create and have limited application.","PeriodicalId":447867,"journal":{"name":"Proceedings of the 5th Information Interaction in Context Symposium","volume":"10 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th Information Interaction in Context Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2637002.2637003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The evaluation and tuning of information retrieval (IR) systems based on the Cranfield paradigm requires purpose built test collections, which include sets of human contributed relevance labels, indicating the relevance of search results to a set of user queries. Traditional methods of collecting relevance labels rely on a fixed group of hired expert judges, who are trained to interpret user queries as accurately as possible and label documents accordingly. Human judges and the obtained relevance labels thus provide a critical link within the Cranfield style IR evaluation framework, where disagreement among judges and the impact of variable judgment sets on the final outcome of an evaluation is a well studied issue. There is also reported evidence that experiment outcomes can be affected by changes to the judging guidelines or changes in the judge population. Recently, the growing volume and diversity of the topics and documents to be judged is driving the increased adoption of crowdsourcing methods in IR evaluation, offering a viable alternative that scales with modest costs. In this model, relevance judgments are distributed online over a large population of humans, a crowd, facilitated, for example, by a crowdsourcing platform, such as Amazon's Mechanical Turk or Clickworker. Such platforms allow millions of anonymous crowd workers to be hired temporarily for micro-payments to complete so-called human intelligence tasks (HITs), such as labeling images or documents. Studies have shown that workers come from diverse backgrounds, work in a variety of different environments, and have different motivations. For example, users may turn to crowdsourcing as a way to make a living, to serve an altruistic or social purpose or simply to fill their time. They may become loyal crowd workers on one or more platforms, or they may leave after their first couple of encounters. Clearly, such a model is in stark contrast to the highly controlled methods that characterize the work of trained judges. For example, in a micro-task based crowdsourcing setup, worker training is usually minimal or non-existent. Furthermore, it is widely reported that labels provided by crowd workers can vary in quality, leading to noisy labels. Crowdsourcing can also suffer from undesirable worker behaviour and practices, e.g., dishonest behaviour or lack of expertise, that result in low quality contributions. While a range of quality assurance and control techniques have now been developed to reduce noise during or after task completion, little is known about the workers themselves and possible relationships between workers' characteristics, behaviour and the quality of their work. In this talk, I will review the findings of recent research that examines and compares trained judges and crowd workers hired to complete relevance assessment tasks of varying difficulty. The investigations include a range of aspects from how HIT design, judging instructions, worker demographics and characteristics may impact work quality. The main focus of the talk will be on experiments aimed to uncover characteristics of the crowd by monitoring their behaviour during different relevance assessment tasks, and compare them to professional judges' behaviour on the same tasks. Throughout the talk I will highlight challenges of quality assurance and control in crowdsourcing and propose a possible direction for solving the issue without relying on gold standard data sets, which are expensive to create and have limited application.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
人类参与的信息检索评估
基于克兰菲尔德范式的信息检索(IR)系统的评估和调整需要专门构建的测试集合,其中包括人类贡献的相关标签集,表明搜索结果与一组用户查询的相关性。收集相关标签的传统方法依赖于聘请的一组固定的专家评委,他们经过培训,尽可能准确地解释用户的查询,并相应地标记文档。因此,人类法官和获得的相关标签在克兰菲尔德风格IR评估框架内提供了一个关键环节,其中法官之间的分歧和变量判断集对评估最终结果的影响是一个得到充分研究的问题。也有报告证据表明,实验结果可能受到评判准则的变化或法官群体的变化的影响。最近,要评判的主题和文件的数量和多样性不断增加,推动了在IR评估中越来越多地采用众包方法,提供了一种成本适中的可行替代方案。在这个模型中,相关性判断在网上分发给大量的人,一个群体,例如,通过众包平台,如亚马逊的Mechanical Turk或Clickworker提供便利。这些平台允许数以百万计的匿名人群工人被临时雇佣,以获得小额报酬来完成所谓的人类智能任务(hit),比如给图像或文件贴上标签。研究表明,员工来自不同的背景,在各种不同的环境中工作,并且有不同的动机。例如,用户可能将众包作为一种谋生方式,服务于利他主义或社会目的,或者只是为了填补他们的时间。他们可能会在一个或多个平台上成为忠实的群体工作者,也可能在第一次接触之后离开。显然,这种模式与训练有素的法官工作中受到高度控制的方法形成鲜明对比。例如,在基于微任务的众包设置中,工人培训通常很少或根本不存在。此外,据广泛报道,人群工作者提供的标签质量参差不齐,导致标签噪音。众包还可能受到不受欢迎的工作人员行为和做法的影响,例如,不诚实的行为或缺乏专业知识,从而导致低质量的贡献。虽然现在已经开发了一系列质量保证和控制技术来减少任务完成期间或之后的噪音,但对工人本身以及工人的特征、行为和工作质量之间可能存在的关系知之甚少。在这次演讲中,我将回顾最近的研究结果,这些研究检查并比较了训练有素的法官和被雇佣来完成不同难度的相关性评估任务的人群工作人员。调查包括从HIT设计、判断指令、工人人口统计和特征如何影响工作质量的一系列方面。本次演讲的主要焦点是通过实验,通过监测人群在不同相关性评估任务中的行为来揭示人群的特征,并将其与专业法官在相同任务中的行为进行比较。在整个演讲中,我将强调在众包中质量保证和控制的挑战,并提出一个解决问题的可能方向,而不依赖于黄金标准数据集,这是昂贵的创建和有限的应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Designing autotelic searching experience for casual-leisure by using the user's context X-REC: cross-category entity recommendation Itinerary recommenders: how do users customize their routes and what can we learn from them? Investigating people: a qualitative analysis of the search behaviours of open-source intelligence analysts YASFIIRE: yet another system for IIR evaluation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1