Proceedings of the 25th International Conference on World Wide Web最新文献_第3页

Scheduling Human Intelligence Tasks in Multi-Tenant Crowd-Powered Systems 在多租户群体动力系统中调度人类智能任务

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883030

D. Difallah, Gianluca Demartini, P. Cudré-Mauroux

Micro-task crowdsourcing has become a popular approach to effectively tackle complex data management problems such as data linkage, missing values, or schema matching. However, the backend crowdsourced operators of crowd-powered systems typically yield higher latencies than the machine-processable operators, this is mainly due to inherent efficiency differences between humans and machines. This problem can be further exacerbated by the lack of workers on the target crowdsourcing platform, or when the workers are shared unequally among a number of competing requesters; including the concurrent users from the same organization who execute crowdsourced queries with different types, priorities and prices. Under such conditions, a crowd-powered system acts mostly as a proxy to the crowdsourcing platform, and hence it is very difficult to provide effiency guarantees to its end-users. Scheduling is the traditional way of tackling such problems in computer science, by prioritizing access to shared resources. In this paper, we propose a new crowdsourcing system architecture that leverages scheduling algorithms to optimize task execution in a shared resources environment, in this case a crowdsourcing platform. Our study aims at assessing the efficiency of the crowd in settings where multiple types of tasks are run concurrently. We present extensive experimental results comparing i) different multi-tenant crowdsourcing jobs, including a workload derived from real traces, and ii) different scheduling techniques tested with real crowd workers. Our experimental results show that task scheduling can be leveraged to achieve fairness and reduce query latency in multi-tenant crowd-powered systems, although with very different tradeoffs compared to traditional settings not including human factors.

微任务众包已经成为一种流行的方法，可以有效地解决复杂的数据管理问题，如数据链接、缺失值或模式匹配。然而，群体驱动系统的后端众包操作通常比机器可处理的操作产生更高的延迟，这主要是由于人类和机器之间固有的效率差异。如果目标众包平台上缺少工作人员，或者工作人员在多个相互竞争的请求者之间分配不均，则会进一步加剧这一问题;包括来自同一组织的并发用户，他们执行不同类型、优先级和价格的众包查询。在这种情况下，众包系统在很大程度上充当了众包平台的代理，很难为最终用户提供效率保障。在计算机科学中，调度是解决这类问题的传统方法，通过优先访问共享资源。在本文中，我们提出了一种新的众包系统架构，该架构利用调度算法在共享资源环境中优化任务执行，在这种情况下是一个众包平台。我们的研究旨在评估在多种任务同时运行的情况下人群的效率。我们提供了广泛的实验结果，比较i)不同的多租户众包工作，包括来自真实轨迹的工作量，以及ii)用真实人群工作人员测试的不同调度技术。我们的实验结果表明，在多租户人群驱动的系统中，可以利用任务调度来实现公平性并减少查询延迟，尽管与不包括人为因素的传统设置相比，需要进行非常不同的权衡。

{"title":"Scheduling Human Intelligence Tasks in Multi-Tenant Crowd-Powered Systems","authors":"D. Difallah, Gianluca Demartini, P. Cudré-Mauroux","doi":"10.1145/2872427.2883030","DOIUrl":"https://doi.org/10.1145/2872427.2883030","url":null,"abstract":"Micro-task crowdsourcing has become a popular approach to effectively tackle complex data management problems such as data linkage, missing values, or schema matching. However, the backend crowdsourced operators of crowd-powered systems typically yield higher latencies than the machine-processable operators, this is mainly due to inherent efficiency differences between humans and machines. This problem can be further exacerbated by the lack of workers on the target crowdsourcing platform, or when the workers are shared unequally among a number of competing requesters; including the concurrent users from the same organization who execute crowdsourced queries with different types, priorities and prices. Under such conditions, a crowd-powered system acts mostly as a proxy to the crowdsourcing platform, and hence it is very difficult to provide effiency guarantees to its end-users. Scheduling is the traditional way of tackling such problems in computer science, by prioritizing access to shared resources. In this paper, we propose a new crowdsourcing system architecture that leverages scheduling algorithms to optimize task execution in a shared resources environment, in this case a crowdsourcing platform. Our study aims at assessing the efficiency of the crowd in settings where multiple types of tasks are run concurrently. We present extensive experimental results comparing i) different multi-tenant crowdsourcing jobs, including a workload derived from real traces, and ii) different scheduling techniques tested with real crowd workers. Our experimental results show that task scheduling can be leveraged to achieve fairness and reduce query latency in multi-tenant crowd-powered systems, although with very different tradeoffs compared to traditional settings not including human factors.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87595280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

The Communication Network Within the Crowd 人群中的通信网络

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883036

Ming Yin, Mary L. Gray, Siddharth Suri, Jennifer Wortman Vaughan

Since its inception, crowdsourcing has been considered a black-box approach to solicit labor from a crowd of workers. Furthermore, the "crowd" has been viewed as a group of independent workers dispersed all over the world. Recent studies based on in-person interviews have opened up the black box and shown that the crowd is not a collection of independent workers, but instead that workers communicate and collaborate with each other. Put another way, prior work has shown the existence of edges between workers. We build on and extend this discovery by mapping the entire communication network of workers on Amazon Mechanical Turk, a leading crowdsourcing platform. We execute a task in which over 10,000 workers from across the globe self-report their communication links to other workers, thereby mapping the communication network among workers. Our results suggest that while a large percentage of workers indeed appear to be independent, there is a rich network topology over the rest of the population. That is, there is a substantial communication network within the crowd. We further examine how online forum usage relates to network topology, how workers communicate with each other via this network, how workers' experience levels relate to their network positions, and how U.S. workers differ from international workers in their network characteristics. We conclude by discussing the implications of our findings for requesters, workers, and platform providers like Amazon.

从一开始，众包就被认为是一种从一群工人中征集劳动力的黑箱方法。此外，“人群”被视为一群分散在世界各地的独立工作者。最近基于面对面访谈的研究打开了黑盒子，表明人群不是独立工作者的集合，而是工人之间相互沟通和协作。换句话说，先前的研究表明，工人之间存在边缘。我们在这一发现的基础上，通过绘制亚马逊土耳其机械(Amazon Mechanical Turk)——一个领先的众包平台——工人的整个通信网络，对这一发现进行了扩展。我们执行了一项任务，来自全球各地的10,000多名工人自我报告他们与其他工人的通信联系，从而绘制了工人之间的通信网络。我们的研究结果表明，虽然很大一部分员工看起来确实是独立的，但在其余人群中存在丰富的网络拓扑结构。也就是说，在人群中有一个实质性的通信网络。我们进一步研究了在线论坛的使用如何与网络拓扑相关，工人如何通过该网络相互交流，工人的经验水平如何与他们的网络位置相关，以及美国工人在网络特征上与国际工人有何不同。最后，我们讨论了我们的发现对请求者、工作人员和平台提供商(如Amazon)的影响。

{"title":"The Communication Network Within the Crowd","authors":"Ming Yin, Mary L. Gray, Siddharth Suri, Jennifer Wortman Vaughan","doi":"10.1145/2872427.2883036","DOIUrl":"https://doi.org/10.1145/2872427.2883036","url":null,"abstract":"Since its inception, crowdsourcing has been considered a black-box approach to solicit labor from a crowd of workers. Furthermore, the \"crowd\" has been viewed as a group of independent workers dispersed all over the world. Recent studies based on in-person interviews have opened up the black box and shown that the crowd is not a collection of independent workers, but instead that workers communicate and collaborate with each other. Put another way, prior work has shown the existence of edges between workers. We build on and extend this discovery by mapping the entire communication network of workers on Amazon Mechanical Turk, a leading crowdsourcing platform. We execute a task in which over 10,000 workers from across the globe self-report their communication links to other workers, thereby mapping the communication network among workers. Our results suggest that while a large percentage of workers indeed appear to be independent, there is a rich network topology over the rest of the population. That is, there is a substantial communication network within the crowd. We further examine how online forum usage relates to network topology, how workers communicate with each other via this network, how workers' experience levels relate to their network positions, and how U.S. workers differ from international workers in their network characteristics. We conclude by discussing the implications of our findings for requesters, workers, and platform providers like Amazon.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88499674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 93

IncApprox: A Data Analytics System for Incremental Approximate Computing IncApprox:一个用于增量近似计算的数据分析系统

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883026

Dhanya R. Krishnan, D. Quoc, Pramod Bhatotia, C. Fetzer, R. Rodrigues

Incremental and approximate computations are increasingly being adopted for data analytics to achieve low-latency execution and efficient utilization of computing resources. Incremental computation updates the output incrementally instead of re-computing everything from scratch for successive runs of a job with input changes. Approximate computation returns an approximate output for a job instead of the exact output. Both paradigms rely on computing over a subset of data items instead of computing over the entire dataset, but they differ in their means for skipping parts of the computation. Incremental computing relies on the memoization of intermediate results of sub-computations, and reusing these memoized results across jobs. Approximate computing relies on representative sampling of the entire dataset to compute over a subset of data items. In this paper, we observe that these two paradigms are complementary, and can be married together! Our idea is quite simple: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To realize this idea, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. We implemented our algorithm in a data analytics system called IncApprox based on Apache Spark Streaming. Our evaluation using micro-benchmarks and real-world case-studies shows that IncApprox achieves the benefits of both incremental and approximate computing.

增量计算和近似计算越来越多地被用于数据分析，以实现低延迟执行和有效利用计算资源。增量计算以增量方式更新输出，而不是在输入发生变化的情况下为连续运行的作业从头开始重新计算所有内容。近似计算返回作业的近似输出，而不是精确输出。这两种范式都依赖于对数据项子集的计算，而不是对整个数据集的计算，但是它们在跳过部分计算的方法上有所不同。增量计算依赖于子计算中间结果的记忆，并跨作业重用这些记忆结果。近似计算依赖于整个数据集的代表性采样来计算数据项的子集。在本文中，我们观察到这两种范式是互补的，可以结合在一起!我们的想法很简单:设计一个抽样算法，使样本选择偏向于以前运行的记忆数据项。为了实现这一想法，我们设计了一种在线分层抽样算法，该算法使用自调整计算来产生具有有限误差的增量更新近似输出。我们在一个基于Apache Spark Streaming的数据分析系统IncApprox中实现了我们的算法。我们使用微基准测试和实际案例研究的评估表明，IncApprox实现了增量计算和近似计算的好处。

{"title":"IncApprox: A Data Analytics System for Incremental Approximate Computing","authors":"Dhanya R. Krishnan, D. Quoc, Pramod Bhatotia, C. Fetzer, R. Rodrigues","doi":"10.1145/2872427.2883026","DOIUrl":"https://doi.org/10.1145/2872427.2883026","url":null,"abstract":"Incremental and approximate computations are increasingly being adopted for data analytics to achieve low-latency execution and efficient utilization of computing resources. Incremental computation updates the output incrementally instead of re-computing everything from scratch for successive runs of a job with input changes. Approximate computation returns an approximate output for a job instead of the exact output. Both paradigms rely on computing over a subset of data items instead of computing over the entire dataset, but they differ in their means for skipping parts of the computation. Incremental computing relies on the memoization of intermediate results of sub-computations, and reusing these memoized results across jobs. Approximate computing relies on representative sampling of the entire dataset to compute over a subset of data items. In this paper, we observe that these two paradigms are complementary, and can be married together! Our idea is quite simple: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To realize this idea, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. We implemented our algorithm in a data analytics system called IncApprox based on Apache Spark Streaming. Our evaluation using micro-benchmarks and real-world case-studies shows that IncApprox achieves the benefits of both incremental and approximate computing.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89833577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 80

Mining Online Social Data for Detecting Social Network Mental Disorders 挖掘在线社交数据检测社交网络精神障碍

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882996

Hong-Han Shuai, Chih-Ya Shen, De-Nian Yang, Yi-Feng Lan, Wang-Chien Lee, Philip S. Yu, Ming-Syan Chen

An increasing number of social network mental disorders (SNMDs), such as Cyber-Relationship Addiction, Information Overload, and Net Compulsion, have been recently noted. Symptoms of these mental disorders are usually observed passively today, resulting in delayed clinical intervention. In this paper, we argue that mining online social behavior provides an opportunity to actively identify SNMDs at an early stage. It is challenging to detect SNMDs because the mental factors considered in standard diagnostic criteria (questionnaire) cannot be observed from online social activity logs. Our approach, new and innovative to the practice of SNMD detection, does not rely on self-revealing of those mental factors via questionnaires. Instead, we propose a machine learning framework, namely, Social Network Mental Disorder Detection (SNMDD), that exploits features extracted from social network data to accurately identify potential cases of SNMDs. We also exploit multi-source learning in SNMDD and propose a new SNMDbased Tensor Model (STM) to improve the performance. Our framework is evaluated via a user study with 3126 online social network users. We conduct a feature analysis, and also apply SNMDD on large-scale datasets and analyze the characteristics of the three SNMD types. The results show that SNMDD is promising for identifying online social network users with potential SNMDs.

近年来，越来越多的社会网络精神障碍(SNMDs)，如网络关系成瘾、信息超载和网络强迫，已被注意到。目前，这些精神障碍的症状通常是被动观察到的，导致临床干预延迟。在本文中，我们认为挖掘在线社会行为提供了一个在早期阶段积极识别snmd的机会。检测snmd具有挑战性，因为标准诊断标准(问卷)中考虑的心理因素无法从在线社会活动日志中观察到。我们的方法是新的和创新的SNMD检测实践，不依赖于通过问卷调查自我揭示这些心理因素。相反，我们提出了一个机器学习框架，即社交网络精神障碍检测(SNMDD)，它利用从社交网络数据中提取的特征来准确识别潜在的snmd病例。我们还利用SNMDD中的多源学习，并提出了一种新的基于SNMDD的张量模型(STM)来提高性能。我们的框架是通过对3126名在线社交网络用户的用户研究来评估的。我们进行了特征分析，并将SNMDD应用于大规模数据集，分析了三种SNMD类型的特征。结果表明，SNMDD有望用于识别具有潜在snmd的在线社交网络用户。

{"title":"Mining Online Social Data for Detecting Social Network Mental Disorders","authors":"Hong-Han Shuai, Chih-Ya Shen, De-Nian Yang, Yi-Feng Lan, Wang-Chien Lee, Philip S. Yu, Ming-Syan Chen","doi":"10.1145/2872427.2882996","DOIUrl":"https://doi.org/10.1145/2872427.2882996","url":null,"abstract":"An increasing number of social network mental disorders (SNMDs), such as Cyber-Relationship Addiction, Information Overload, and Net Compulsion, have been recently noted. Symptoms of these mental disorders are usually observed passively today, resulting in delayed clinical intervention. In this paper, we argue that mining online social behavior provides an opportunity to actively identify SNMDs at an early stage. It is challenging to detect SNMDs because the mental factors considered in standard diagnostic criteria (questionnaire) cannot be observed from online social activity logs. Our approach, new and innovative to the practice of SNMD detection, does not rely on self-revealing of those mental factors via questionnaires. Instead, we propose a machine learning framework, namely, Social Network Mental Disorder Detection (SNMDD), that exploits features extracted from social network data to accurately identify potential cases of SNMDs. We also exploit multi-source learning in SNMDD and propose a new SNMDbased Tensor Model (STM) to improve the performance. Our framework is evaluated via a user study with 3126 online social network users. We conduct a feature analysis, and also apply SNMDD on large-scale datasets and analyze the characteristics of the three SNMD types. The results show that SNMDD is promising for identifying online social network users with potential SNMDs.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"43 3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90478858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

HeteroSales: Utilizing Heterogeneous Social Networks to Identify the Next Enterprise Customer 异质销售:利用异质社会网络识别下一个企业客户

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883000

Qingbo Hu, Sihong Xie, Jiawei Zhang, Qiang Zhu, Songtao Guo, Philip S. Yu

Nowadays, a modern e-commerce company may have both online sales and offline sales departments. Normally, online sales attempt to sell in small quantities to individual customers through broadcasting a large amount of emails or promotion codes, which heavily rely on the designed backend algorithms. Offline sales, on the other hand, try to sell in much larger quantities to enterprise customers through contacts initiated by sales representatives, which are more costly compared to online sales. Unlike many previous research works focusing on machine learning algorithms to support online sales, this paper introduces an approach that utilizes heterogenous social networks to improve the effectiveness of offline sales. More specifically, we propose a two-phase framework, HeteroSales, which first constructs a company-to-company graph, a.k.a. Company Homophily Graph (CHG), from semantics based meta-path learning, and then adopts label propagation on the graph to predict promising companies that we may successfully close an offline deal with. Based on the statistical analysis on the world's largest professional social network, LinkedIn, we demonstrate interesting discoveries showing that not all the social connections in a heterogeneous social network are useful in this task. In other words, some proper data preprocessing is essential to ensure the effectiveness of offline sales. Finally, through the experiments on LinkedIn social network data and third-party offline sales records, we demonstrate the power of HereroSales to identify potential enterprise customers in offline sales.

如今，一家现代电子商务公司可能同时拥有线上销售和线下销售部门。通常情况下，在线销售试图通过大量的电子邮件或促销代码向个人客户少量销售，这在很大程度上依赖于设计好的后端算法。另一方面，线下销售试图通过销售代表发起的联系向企业客户大量销售，这比线上销售成本更高。与许多先前的研究工作专注于机器学习算法来支持在线销售不同，本文介绍了一种利用异质社交网络来提高线下销售效率的方法。更具体地说，我们提出了一个两阶段框架，HeteroSales，它首先从基于语义的元路径学习构建一个公司到公司的图，又称公司同质图(CHG)，然后在图上采用标签传播来预测我们可能成功完成线下交易的有前途的公司。基于对世界上最大的职业社交网络LinkedIn的统计分析，我们展示了有趣的发现，表明并非异构社交网络中的所有社会联系都对这项任务有用。换句话说，为了保证线下销售的有效性，适当的数据预处理是必不可少的。最后，通过对LinkedIn社交网络数据和第三方线下销售记录的实验，验证了HereroSales在线下销售中识别潜在企业客户的能力。

{"title":"HeteroSales: Utilizing Heterogeneous Social Networks to Identify the Next Enterprise Customer","authors":"Qingbo Hu, Sihong Xie, Jiawei Zhang, Qiang Zhu, Songtao Guo, Philip S. Yu","doi":"10.1145/2872427.2883000","DOIUrl":"https://doi.org/10.1145/2872427.2883000","url":null,"abstract":"Nowadays, a modern e-commerce company may have both online sales and offline sales departments. Normally, online sales attempt to sell in small quantities to individual customers through broadcasting a large amount of emails or promotion codes, which heavily rely on the designed backend algorithms. Offline sales, on the other hand, try to sell in much larger quantities to enterprise customers through contacts initiated by sales representatives, which are more costly compared to online sales. Unlike many previous research works focusing on machine learning algorithms to support online sales, this paper introduces an approach that utilizes heterogenous social networks to improve the effectiveness of offline sales. More specifically, we propose a two-phase framework, HeteroSales, which first constructs a company-to-company graph, a.k.a. Company Homophily Graph (CHG), from semantics based meta-path learning, and then adopts label propagation on the graph to predict promising companies that we may successfully close an offline deal with. Based on the statistical analysis on the world's largest professional social network, LinkedIn, we demonstrate interesting discoveries showing that not all the social connections in a heterogeneous social network are useful in this task. In other words, some proper data preprocessing is essential to ensure the effectiveness of offline sales. Finally, through the experiments on LinkedIn social network data and third-party offline sales records, we demonstrate the power of HereroSales to identify potential enterprise customers in offline sales.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"107 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76183808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Dot Everyone!!: Power, the Internet and You 点大家! !:权力、互联网和你

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2874817

M. Fox

引用次数: 0

Using Hierarchical Skills for Optimized Task Assignment in Knowledge-Intensive Crowdsourcing 运用层次技能优化知识密集型众包任务分配

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883070

Panagiotis Mavridis, D. Gross-Amblard, Z. Miklós

Besides the simple human intelligence tasks such as image labeling, crowdsourcing platforms propose more and more tasks that require very specific skills, especially in participative science projects. In this context, there is a need to reason about the required skills for a task and the set of available skills in the crowd, in order to increase the resulting quality. Most of the existing solutions rely on unstructured tags to model skills (vector of skills). In this paper we propose to finely model tasks and participants using a skill tree, that is a taxonomy of skills equipped with a similarity distance within skills. This model of skills enables to map participants to tasks in a way that exploits the natural hierarchy among the skills. We illustrate the effectiveness of our model and algorithms through extensive experimentation with synthetic and real data sets.

除了简单的人类智能任务，如图像标记，众包平台提出了越来越多的任务，需要非常具体的技能，特别是在参与科学项目。在这种情况下，需要对任务所需的技能和人群中可用的技能集进行推理，以提高结果质量。大多数现有的解决方案依赖于非结构化标签来建模技能(技能向量)。在本文中，我们提出使用技能树对任务和参与者进行精细建模，这是一种技能分类，在技能中配备了相似距离。这个技能模型能够以一种利用技能之间的自然层次结构的方式将参与者映射到任务。我们通过对合成数据集和真实数据集的大量实验来说明我们的模型和算法的有效性。

引用次数: 91

Collaborative Nowcasting for Contextual Recommendation 上下文推荐的协同临近广播

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2874812

Yu Sun, Nicholas Jing Yuan, Xing Xie, Kieran McDonald, Rui Zhang

Mobile digital assistants such as Microsoft Cortana and Google Now currently offer appealing proactive experiences to users, which aim to deliver the right information at the right time. To achieve this goal, it is crucial to precisely predict users' real-time intent. Intent is closely related to context, which includes not only the spatial-temporal information but also users' current activities that can be sensed by mobile devices. The relationship between intent and context is highly dynamic and exhibits chaotic sequential correlation. The context itself is often sparse and heterogeneous. The dynamics and co-movement among contextual signals are also elusive and complicated. Traditional recommendation models cannot directly apply to proactive experiences because they fail to tackle the above challenges. Inspired by the nowcasting practice in meteorology and macroeconomics, we propose an innovative collaborative nowcasting model to effectively resolve these challenges. The proposed model successfully addresses sparsity and heterogeneity of contextual signals. It also effectively models the convoluted correlation within contextual signals and between context and intent. Specifically, the model first extracts collaborative latent factors, which summarize shared temporal structural patterns in contextual signals, and then exploits the collaborative Kalman Filter to generate serially correlated personalized latent factors, which are utilized to monitor each user's real-time intent. Extensive experiments with real-world data sets from a commercial digital assistant demonstrate the effectiveness of the collaborative nowcasting model. The studied problem and model provide inspiring implications for new paradigms of recommendations on mobile intelligent devices.

微软Cortana和Google Now等移动数字助理目前为用户提供了具有吸引力的主动体验，旨在在正确的时间提供正确的信息。为了实现这一目标，准确预测用户的实时意图至关重要。意图与语境密切相关，语境不仅包括时空信息，还包括移动设备可以感知到的用户当前的活动。意图和语境之间的关系是高度动态的，表现出混沌的顺序关联。上下文本身通常是稀疏和异构的。上下文信号之间的动态和协同运动也是难以捉摸和复杂的。传统的推荐模型不能直接应用于主动体验，因为它们不能解决上述挑战。受气象学和宏观经济学中临近预报实践的启发，我们提出了一种创新的协同临近预报模型来有效地解决这些挑战。该模型成功地解决了上下文信号的稀疏性和异质性问题。它还有效地模拟了上下文信号和上下文与意图之间的复杂关联。具体而言，该模型首先提取协同潜在因素，总结上下文信号中共享的时间结构模式，然后利用协同卡尔曼滤波生成序列相关的个性化潜在因素，利用这些潜在因素监测每个用户的实时意图。对来自商业数字助理的真实世界数据集的大量实验证明了协作临近预报模型的有效性。所研究的问题和模型为移动智能设备上的推荐新范式提供了鼓舞人心的启示。

{"title":"Collaborative Nowcasting for Contextual Recommendation","authors":"Yu Sun, Nicholas Jing Yuan, Xing Xie, Kieran McDonald, Rui Zhang","doi":"10.1145/2872427.2874812","DOIUrl":"https://doi.org/10.1145/2872427.2874812","url":null,"abstract":"Mobile digital assistants such as Microsoft Cortana and Google Now currently offer appealing proactive experiences to users, which aim to deliver the right information at the right time. To achieve this goal, it is crucial to precisely predict users' real-time intent. Intent is closely related to context, which includes not only the spatial-temporal information but also users' current activities that can be sensed by mobile devices. The relationship between intent and context is highly dynamic and exhibits chaotic sequential correlation. The context itself is often sparse and heterogeneous. The dynamics and co-movement among contextual signals are also elusive and complicated. Traditional recommendation models cannot directly apply to proactive experiences because they fail to tackle the above challenges. Inspired by the nowcasting practice in meteorology and macroeconomics, we propose an innovative collaborative nowcasting model to effectively resolve these challenges. The proposed model successfully addresses sparsity and heterogeneity of contextual signals. It also effectively models the convoluted correlation within contextual signals and between context and intent. Specifically, the model first extracts collaborative latent factors, which summarize shared temporal structural patterns in contextual signals, and then exploits the collaborative Kalman Filter to generate serially correlated personalized latent factors, which are utilized to monitor each user's real-time intent. Extensive experiments with real-world data sets from a commercial digital assistant demonstrate the effectiveness of the collaborative nowcasting model. The studied problem and model provide inspiring implications for new paradigms of recommendations on mobile intelligent devices.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83775032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes 网络上的虚假信息:维基百科骗局的影响、特征和检测

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883085

Srijan Kumar, Robert West, J. Leskovec

Wikipedia is a major source of information for many people. However, false information on Wikipedia raises concerns about its credibility. One way in which false information may be presented on Wikipedia is in the form of hoax articles, i.e., articles containing fabricated facts about nonexistent entities or events. In this paper we study false information on Wikipedia by focusing on the hoax articles that have been created throughout its history. We make several contributions. First, we assess the real-world impact of hoax articles by measuring how long they survive before being debunked, how many pageviews they receive, and how heavily they are referred to by documents on the Web. We find that, while most hoaxes are detected quickly and have little impact on Wikipedia, a small number of hoaxes survive long and are well cited across the Web. Second, we characterize the nature of successful hoaxes by comparing them to legitimate articles and to failed hoaxes that were discovered shortly after being created. We find characteristic differences in terms of article structure and content, embeddedness into the rest of Wikipedia, and features of the editor who created the hoax. Third, we successfully apply our findings to address a series of classification tasks, most notably to determine whether a given article is a hoax. And finally, we describe and evaluate a task involving humans distinguishing hoaxes from non-hoaxes. We find that humans are not particularly good at the task and that our automated classifier outperforms them by a big margin.

维基百科是许多人获取信息的主要来源。然而，维基百科上的虚假信息引发了人们对其可信度的担忧。在维基百科上出现虚假信息的一种方式是以恶作剧文章的形式出现的，也就是说，这些文章包含了关于不存在的实体或事件的虚构事实。在本文中，我们通过关注维基百科历史上创建的恶作剧文章来研究维基百科上的虚假信息。我们做了几项贡献。首先，我们通过衡量虚假文章在被揭穿前的存活时间、获得的浏览量以及在网络上被文档引用的次数来评估它们对现实世界的影响。我们发现，虽然大多数骗局被迅速发现，对维基百科的影响很小，但少数骗局存活了很长时间，并在整个网络上被广泛引用。其次，我们通过将成功的骗局与合法文章和在创建后不久发现的失败骗局进行比较，来描述成功骗局的性质。我们在文章结构和内容、嵌入到维基百科的其他部分以及制造骗局的编辑的特征方面发现了特征差异。第三，我们成功地将我们的发现应用于解决一系列分类任务，最显著的是确定给定文章是否为骗局。最后，我们描述和评估一个涉及人类区分恶作剧和非恶作剧的任务。我们发现人类并不是特别擅长这项任务，我们的自动分类器在很大程度上超过了他们。

{"title":"Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes","authors":"Srijan Kumar, Robert West, J. Leskovec","doi":"10.1145/2872427.2883085","DOIUrl":"https://doi.org/10.1145/2872427.2883085","url":null,"abstract":"Wikipedia is a major source of information for many people. However, false information on Wikipedia raises concerns about its credibility. One way in which false information may be presented on Wikipedia is in the form of hoax articles, i.e., articles containing fabricated facts about nonexistent entities or events. In this paper we study false information on Wikipedia by focusing on the hoax articles that have been created throughout its history. We make several contributions. First, we assess the real-world impact of hoax articles by measuring how long they survive before being debunked, how many pageviews they receive, and how heavily they are referred to by documents on the Web. We find that, while most hoaxes are detected quickly and have little impact on Wikipedia, a small number of hoaxes survive long and are well cited across the Web. Second, we characterize the nature of successful hoaxes by comparing them to legitimate articles and to failed hoaxes that were discovered shortly after being created. We find characteristic differences in terms of article structure and content, embeddedness into the rest of Wikipedia, and features of the editor who created the hoax. Third, we successfully apply our findings to address a series of classification tasks, most notably to determine whether a given article is a hoax. And finally, we describe and evaluate a task involving humans distinguishing hoaxes from non-hoaxes. We find that humans are not particularly good at the task and that our automated classifier outperforms them by a big margin.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84563156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 287

Automatic Discovery of Attribute Synonyms Using Query Logs and Table Corpora 基于查询日志和表语料库的属性同义词自动发现

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2874816

Yeye He, K. Chakrabarti, Tao Cheng, Tomasz Tylenda

Attribute synonyms are important ingredients for keyword-based search systems. For instance, web search engines recognize queries that seek the value of an entity on a specific attribute (referred to as e+a queries) and provide direct answers for them using a combination of knowledge bases, web tables and documents. However, users often refer to an attribute in their e+a query differently from how it is referred in the web table or text passage. In such cases, search engines may fail to return relevant answers. To address that problem, we propose to automatically discover all the alternate ways of referring to the attributes of a given class of entities (referred to as attribute synonyms) in order to improve search quality. The state-of-the-art approach that relies on attribute name co-occurrence in web tables suffers from low precision. Our main insight is to combine positive evidence of attribute synonymity from query click logs, with negative evidence from web table attribute name co-occurrences. We formalize the problem as an optimization problem on a graph, with the attribute names being the vertices and the positive and negative evidences from query logs and web table schemas as weighted edges. We develop a linear programming based algorithm to solve the problem that has bi-criteria approximation guarantees. Our experiments on real-life datasets show that our approach has significantly higher precision and recall compared with the state-of-the-art.

属性同义词是基于关键字搜索系统的重要组成部分。例如，web搜索引擎识别在特定属性上寻找实体值的查询(称为e+a查询)，并使用知识库、web表和文档的组合为它们提供直接答案。然而，用户在e+a查询中引用属性的方式通常与在web表或文本段落中引用属性的方式不同。在这种情况下，搜索引擎可能无法返回相关的答案。为了解决这个问题，我们建议自动发现引用给定实体类的属性的所有替代方法(称为属性同义词)，以提高搜索质量。依赖于web表中属性名共存的最先进的方法存在精度低的问题。我们的主要见解是将来自查询点击日志的属性同义性的正面证据与来自web表属性名称共现的负面证据结合起来。我们将该问题形式化为图上的优化问题，将属性名作为顶点，将查询日志和web表模式中的正证据和负证据作为加权边。我们提出了一种基于线性规划的算法来解决具有双准则近似保证的问题。我们在真实数据集上的实验表明，与最先进的方法相比，我们的方法具有更高的精度和召回率。

{"title":"Automatic Discovery of Attribute Synonyms Using Query Logs and Table Corpora","authors":"Yeye He, K. Chakrabarti, Tao Cheng, Tomasz Tylenda","doi":"10.1145/2872427.2874816","DOIUrl":"https://doi.org/10.1145/2872427.2874816","url":null,"abstract":"Attribute synonyms are important ingredients for keyword-based search systems. For instance, web search engines recognize queries that seek the value of an entity on a specific attribute (referred to as e+a queries) and provide direct answers for them using a combination of knowledge bases, web tables and documents. However, users often refer to an attribute in their e+a query differently from how it is referred in the web table or text passage. In such cases, search engines may fail to return relevant answers. To address that problem, we propose to automatically discover all the alternate ways of referring to the attributes of a given class of entities (referred to as attribute synonyms) in order to improve search quality. The state-of-the-art approach that relies on attribute name co-occurrence in web tables suffers from low precision. Our main insight is to combine positive evidence of attribute synonymity from query click logs, with negative evidence from web table attribute name co-occurrences. We formalize the problem as an optimization problem on a graph, with the attribute names being the vertices and the positive and negative evidences from query logs and web table schemas as weighted edges. We develop a linear programming based algorithm to solve the problem that has bi-criteria approximation guarantees. Our experiments on real-life datasets show that our approach has significantly higher precision and recall compared with the state-of-the-art.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"122 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72965123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34