Cross-task crowdsourcing

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI:10.1145/2487575.2487593

Kaixiang Mo, Erheng Zhong, Qiang Yang

{"title":"Cross-task crowdsourcing","authors":"Kaixiang Mo, Erheng Zhong, Qiang Yang","doi":"10.1145/2487575.2487593","DOIUrl":null,"url":null,"abstract":"Crowdsourcing is an effective method for collecting labeled data for various data mining tasks. It is critical to ensure the veracity of the produced data because responses collected from different users may be noisy and unreliable. Previous works solve this veracity problem by estimating both the user ability and question difficulty based on the knowledge in each task individually. In this case, each single task needs large amounts of data to provide accurate estimations. However, in practice, budgets provided by customers for a given target task may be limited, and hence each question can be presented to only a few users where each user can answer only a few questions. This data sparsity problem can cause previous approaches to perform poorly due to the overfitting problem on rare data and eventually damage the data veracity. Fortunately, in real-world applications, users can answer questions from multiple historical tasks. For example, one can annotate images as well as label the sentiment of a given title. In this paper, we employ transfer learning, which borrows knowledge from auxiliary historical tasks to improve the data veracity in a given target task. The motivation is that users have stable characteristics across different crowdsourcing tasks and thus data from different tasks can be exploited collectively to estimate users' abilities in the target task. We propose a hierarchical Bayesian model, TLC (Transfer Learning for Crowdsourcing), to implement this idea by considering the overlapping users as a bridge. In addition, to avoid possible negative impact, TLC introduces task-specific factors to model task differences. The experimental results show that TLC significantly improves the accuracy over several state-of-the-art non-transfer-learning approaches under very limited budget in various labeling tasks.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"50 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2487575.2487593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

Abstract

Crowdsourcing is an effective method for collecting labeled data for various data mining tasks. It is critical to ensure the veracity of the produced data because responses collected from different users may be noisy and unreliable. Previous works solve this veracity problem by estimating both the user ability and question difficulty based on the knowledge in each task individually. In this case, each single task needs large amounts of data to provide accurate estimations. However, in practice, budgets provided by customers for a given target task may be limited, and hence each question can be presented to only a few users where each user can answer only a few questions. This data sparsity problem can cause previous approaches to perform poorly due to the overfitting problem on rare data and eventually damage the data veracity. Fortunately, in real-world applications, users can answer questions from multiple historical tasks. For example, one can annotate images as well as label the sentiment of a given title. In this paper, we employ transfer learning, which borrows knowledge from auxiliary historical tasks to improve the data veracity in a given target task. The motivation is that users have stable characteristics across different crowdsourcing tasks and thus data from different tasks can be exploited collectively to estimate users' abilities in the target task. We propose a hierarchical Bayesian model, TLC (Transfer Learning for Crowdsourcing), to implement this idea by considering the overlapping users as a bridge. In addition, to avoid possible negative impact, TLC introduces task-specific factors to model task differences. The experimental results show that TLC significantly improves the accuracy over several state-of-the-art non-transfer-learning approaches under very limited budget in various labeling tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cross-task众包

众包是为各种数据挖掘任务收集标记数据的有效方法。确保生成数据的准确性至关重要，因为从不同用户收集的响应可能是嘈杂的和不可靠的。以前的工作是根据每个任务中的知识分别估计用户能力和问题难度来解决这个准确性问题。在这种情况下，每个任务都需要大量的数据来提供准确的估计。然而，在实践中，客户为给定的目标任务提供的预算可能是有限的，因此每个问题只能呈现给少数用户，而每个用户只能回答少数问题。这种数据稀疏性问题会导致以前的方法由于对稀有数据的过拟合问题而性能不佳，最终损害数据的准确性。幸运的是，在实际应用程序中，用户可以回答来自多个历史任务的问题。例如，可以对图像进行注释，也可以标记给定标题的情感。在本文中，我们采用迁移学习，从辅助历史任务中借鉴知识来提高给定目标任务中数据的准确性。其动机是用户在不同的众包任务中具有稳定的特征，从而可以集体利用来自不同任务的数据来估计用户在目标任务中的能力。我们提出了一种分层贝叶斯模型TLC (Transfer Learning for Crowdsourcing)，通过将重叠的用户视为桥梁来实现这一想法。此外，为了避免可能的负面影响，TLC引入了任务特定因素来模拟任务差异。实验结果表明，在预算非常有限的情况下，TLC在各种标记任务中显著提高了几种最先进的非迁移学习方法的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量

期刊最新文献

A general bootstrap performance diagnostic Flexible and robust co-regularized multi-domain graph clustering Beyond myopic inference in big data pipelines Constrained stochastic gradient descent for large-scale least squares problem Inferring distant-time location in low-sampling-rate trajectories