Dynamically monitoring crowd-worker's reliability with interval-valued labels

Artificial Intelligence and Social Computing Pub Date : 1900-01-01 DOI:10.54941/ahfe1003270

Chenyi Hu, Makenzie Spurling

{"title":"Dynamically monitoring crowd-worker's reliability with interval-valued labels","authors":"Chenyi Hu, Makenzie Spurling","doi":"10.54941/ahfe1003270","DOIUrl":null,"url":null,"abstract":"Crowdsourcing has rapidly become a computing paradigm in machine learning and artificial intelligence. In crowdsourcing, multiple labels are collected from crowd-workers on an instance usually through the Internet. These labels are then aggregated as a single label to match the ground truth of the instance. Due to its open nature, human workers in crowdsourcing usually come with various levels of knowledge and socio-economic backgrounds. Effectively handling such human factors has been a focus in the study and applications of crowdsourcing. For example, Bi et al studied the impacts of worker's dedication, expertise, judgment, and task difficulty (Bi et al 2014). Qiu et al offered methods for selecting workers based on behavior prediction (Qiu et al 2016). Barbosa and Chen suggested rehumanizing crowdsourcing to deal with human biases (Barbosa 2019). Checco et al studied adversarial attacks on crowdsourcing for quality control (Checco et al 2020). There are many more related works available in literature. In contrast to commonly used binary-valued labels, interval-valued labels (IVLs) have been introduced very recently (Hu et al 2021). Applying statistical and probabilistic properties of interval-valued datasets, Spurling et al quantitatively defined worker's reliability in four measures: correctness, confidence, stability, and predictability (Spurling et al 2021). Calculating these measures, except correctness, does not require the ground truth of each instance but only worker’s IVLs. Applying these quantified reliability measures, people have significantly improved the overall quality of crowdsourcing (Spurling et al 2022). However, in real world applications, the reliability of a worker may vary from time to time rather than a constant. It is necessary to monitor worker’s reliability dynamically. Because a worker j labels instances sequentially, we treat j’s IVLs as an interval-valued time series in our approach. Assuming j’s reliability relies on the IVLs within a time window only, we calculate j’s reliability measures with the IVLs in the current time window. Moving the time window forward with our proposed practical strategies, we can monitor j’s reliability dynamically. Furthermore, the four reliability measures derived from IVLs are time varying too. With regression analysis, we can separate each reliability measure as an explainable trend and possible errors. To validate our approaches, we use four real world benchmark datasets in our computational experiments. Here are the main findings. The reliability weighted interval majority voting (WIMV) and weighted preferred matching probability (WPMP) schemes consistently overperform the base schemes in terms of much higher accuracy, precision, recall, and F1-score. Note: the base schemes are majority voting (MV), interval majority voting (IMV), and preferred matching probability (PMP). Through monitoring worker’s reliability, our computational experiments have successfully identified possible attackers. By removing identified attackers, we have ensured the quality. We have also examined the impact of window size selection. It is necessary to monitor worker’s reliability dynamically, and our computational results evident the potential success of our approaches.This work is partially supported by the US National Science Foundation through the grant award NSF/OIA-1946391.","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence and Social Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.54941/ahfe1003270","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Crowdsourcing has rapidly become a computing paradigm in machine learning and artificial intelligence. In crowdsourcing, multiple labels are collected from crowd-workers on an instance usually through the Internet. These labels are then aggregated as a single label to match the ground truth of the instance. Due to its open nature, human workers in crowdsourcing usually come with various levels of knowledge and socio-economic backgrounds. Effectively handling such human factors has been a focus in the study and applications of crowdsourcing. For example, Bi et al studied the impacts of worker's dedication, expertise, judgment, and task difficulty (Bi et al 2014). Qiu et al offered methods for selecting workers based on behavior prediction (Qiu et al 2016). Barbosa and Chen suggested rehumanizing crowdsourcing to deal with human biases (Barbosa 2019). Checco et al studied adversarial attacks on crowdsourcing for quality control (Checco et al 2020). There are many more related works available in literature. In contrast to commonly used binary-valued labels, interval-valued labels (IVLs) have been introduced very recently (Hu et al 2021). Applying statistical and probabilistic properties of interval-valued datasets, Spurling et al quantitatively defined worker's reliability in four measures: correctness, confidence, stability, and predictability (Spurling et al 2021). Calculating these measures, except correctness, does not require the ground truth of each instance but only worker’s IVLs. Applying these quantified reliability measures, people have significantly improved the overall quality of crowdsourcing (Spurling et al 2022). However, in real world applications, the reliability of a worker may vary from time to time rather than a constant. It is necessary to monitor worker’s reliability dynamically. Because a worker j labels instances sequentially, we treat j’s IVLs as an interval-valued time series in our approach. Assuming j’s reliability relies on the IVLs within a time window only, we calculate j’s reliability measures with the IVLs in the current time window. Moving the time window forward with our proposed practical strategies, we can monitor j’s reliability dynamically. Furthermore, the four reliability measures derived from IVLs are time varying too. With regression analysis, we can separate each reliability measure as an explainable trend and possible errors. To validate our approaches, we use four real world benchmark datasets in our computational experiments. Here are the main findings. The reliability weighted interval majority voting (WIMV) and weighted preferred matching probability (WPMP) schemes consistently overperform the base schemes in terms of much higher accuracy, precision, recall, and F1-score. Note: the base schemes are majority voting (MV), interval majority voting (IMV), and preferred matching probability (PMP). Through monitoring worker’s reliability, our computational experiments have successfully identified possible attackers. By removing identified attackers, we have ensured the quality. We have also examined the impact of window size selection. It is necessary to monitor worker’s reliability dynamically, and our computational results evident the potential success of our approaches.This work is partially supported by the US National Science Foundation through the grant award NSF/OIA-1946391.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用区间值标签动态监测人群工作者的可靠性

众包已经迅速成为机器学习和人工智能领域的计算范式。在众包中，通常通过互联网从一个实例的众包工作者那里收集多个标签。然后将这些标签聚合为单个标签，以匹配实例的基本事实。由于其开放性，众包中的人力工作者通常具有不同的知识水平和社会经济背景。有效处理这些人为因素一直是众包研究和应用的重点。例如，Bi等人研究了员工敬业度、专业知识、判断力和任务难度的影响(Bi et al 2014)。Qiu等人提供了基于行为预测的工人选择方法(Qiu et al . 2016)。Barbosa和Chen建议将众包重新人性化，以应对人类偏见(Barbosa 2019)。Checco等人研究了针对质量控制的众包的对抗性攻击(Checco等人2020)。文献中还有很多相关的作品。与常用的二值标签不同，区间值标签(ivl)是最近才引入的(Hu et al . 2021)。利用区间值数据集的统计和概率特性，Spurling等人在四个方面定量定义了工人的可靠性:正确性、置信度、稳定性和可预测性(Spurling等人2021)。计算这些度量，除了正确性，不需要每个实例的基本事实，而只需要工人的ivl。应用这些量化的可靠性措施，人们显著提高了众包的整体质量(Spurling et al 2022)。然而，在现实世界的应用程序中，工作器的可靠性可能会不时变化，而不是恒定的。对工人的可靠性进行动态监测是必要的。因为工人j按顺序标记实例，所以在我们的方法中，我们将j的ivl视为区间值时间序列。假设j的可靠性仅依赖于一个时间窗口内的ivl，我们用当前时间窗口内的ivl计算j的可靠性度量。利用我们提出的实用策略，将时间窗口向前推进，我们可以动态地监控j的可靠性。此外，由ivl导出的四种可靠性度量也具有时变特性。通过回归分析，我们可以将每个可靠性度量分离为可解释的趋势和可能的误差。为了验证我们的方法，我们在计算实验中使用了四个真实世界的基准数据集。以下是主要发现。可靠性加权区间多数投票(WIMV)和加权首选匹配概率(WPMP)方案在更高的准确性、精密度、召回率和f1分数方面始终优于基本方案。注意:基本方案是多数投票(MV)、间隔多数投票(IMV)和首选匹配概率(PMP)。通过监测工作人员的可靠性，我们的计算实验成功地识别了可能的攻击者。通过移除已识别的攻击者，我们确保了质量。我们还研究了窗口大小选择的影响。动态监测工人的可靠性是必要的，计算结果表明了所提方法的潜在成功。本研究得到了美国国家科学基金会NSF/OIA-1946391项目的部分资助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Artificial Intelligence and Social Computing

自引率

0.00%

发文量