Proceedings of The Web Conference 2020最新文献

英文中文

Graph-Query Suggestions for Knowledge Graph Exploration 知识图探索的图查询建议

Proceedings of The Web Conference 2020

Pub Date : 2020-04-19 DOI: 10.1145/3366423.3380005

Matteo Lissandrini, D. Mottin, Themis Palpanas, Yannis Velegrakis

We consider the task of exploratory search through graph queries on knowledge graphs. We propose to assist the user by expanding the query with intuitive suggestions to provide a more informative (full) query that can retrieve more detailed and relevant answers. To achieve this result, we propose a model that can bridge graph search paradigms with well-established techniques for information-retrieval. Our approach does not require any additional knowledge from the user and builds on principled language modelling approaches. We empirically show the effectiveness and efficiency of our approach on a large knowledge graph and how our suggestions are able to help build more complete and informative queries.

我们考虑了在知识图上通过图查询进行探索性搜索的任务。我们建议通过使用直观的建议扩展查询来帮助用户，以提供更有信息的(完整的)查询，可以检索更详细和相关的答案。为了实现这一结果，我们提出了一个模型，该模型可以将图搜索范式与成熟的信息检索技术连接起来。我们的方法不需要用户提供任何额外的知识，而是建立在有原则的语言建模方法之上。我们通过经验展示了我们的方法在大型知识图谱上的有效性和效率，以及我们的建议如何能够帮助构建更完整和信息丰富的查询。

引用次数: 25

A Cue Adaptive Decoder for Controllable Neural Response Generation 一种可控制神经反应生成的线索自适应解码器

Proceedings of The Web Conference 2020

Pub Date : 2020-04-19 DOI: 10.1145/3366423.3380008

Weichao Wang, Shi Feng, Wei Gao, Daling Wang, Yifei Zhang

In open-domain dialogue systems, dialogue cues such as emotion, persona, and emoji can be incorporated into conversation models for strengthening the semantic relevance of generated responses. Existing neural response generation models either incorporate dialogue cue into decoder’s initial state or embed the cue indiscriminately into the state of every generated word, which may cause the gradients of the embedded cue to vanish or disturb the semantic relevance of generated words during back propagation. In this paper, we propose a Cue Adaptive Decoder (CueAD) that aims to dynamically determine the involvement of a cue at each generation step in the decoding. For this purpose, we extend the Gated Recurrent Unit (GRU) network with an adaptive cue representation for facilitating cue incorporation, in which an adaptive gating unit is utilized to decide when to incorporate cue information so that the cue can provide useful clues for enhancing the semantic relevance of the generated words. Experimental results show that CueAD outperforms state-of-the-art baselines with large margins.

在开放域对话系统中，对话线索(如情感、角色和表情符号)可以合并到对话模型中，以加强生成的响应的语义相关性。现有的神经反应生成模型要么将对话线索纳入解码器的初始状态，要么将对话线索不加选择地嵌入到每个生成词的状态中，这可能导致嵌入线索的梯度消失或在反向传播过程中干扰生成词的语义相关性。在本文中，我们提出了一种线索自适应解码器(Cue Adaptive Decoder, CueAD)，旨在动态地确定解码中每个生成步骤中线索的参与。为此，我们将门控循环单元(GRU)网络扩展为自适应线索表示，以促进线索合并，其中使用自适应门控单元来决定何时合并线索信息，以便线索可以为增强生成词的语义相关性提供有用的线索。实验结果表明，CueAD在较大的边际上优于最先进的基线。

引用次数: 2

Provably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS 可证明的和有效的近似近派系使用Turán阴影:花生

Proceedings of The Web Conference 2020

Pub Date : 2020-04-19 DOI: 10.1145/3366423.3380264

Shweta Jain, C. Seshadhri

Clique and near-clique counts are important graph properties with applications in graph generation, graph modeling, graph analytics, community detection among others. They are the archetypal examples of dense subgraphs. While there are several different definitions of near-cliques, most of them share the attribute that they are cliques that are missing a small number of edges. Clique counting is itself considered a challenging problem. Counting near-cliques is significantly harder more so since the search space for near-cliques is orders of magnitude larger than that of cliques. We give a formulation of a near-clique as a clique that is missing a constant number of edges. We exploit the fact that a near-clique contains a smaller clique, and use techniques for clique sampling to count near-cliques. This method allows us to count near-cliques with 1 or 2 missing edges, in graphs with tens of millions of edges. To the best of our knowledge, there was no known efficient method for this problem, and we obtain a 10x − 100x speedup over existing algorithms for counting near-cliques. Our main technique is a space efficient adaptation of the Turán Shadow sampling approach, recently introduced by Jain and Seshadhri (WWW 2017). This approach constructs a large recursion tree (called the Turán Shadow) that represents cliques in a graph. We design a novel algorithm that builds an estimator for near-cliques, using a online, compact construction of the Turán Shadow.

团和近团计数是重要的图属性，在图生成、图建模、图分析、社区检测等领域都有应用。它们是密集子图的典型例子。虽然有几种不同的近派系定义，但它们中的大多数都有一个共同的属性，即它们是缺少少量边缘的派系。派系计数本身就被认为是一个具有挑战性的问题。计算近派系要困难得多，因为近派系的搜索空间比派系的搜索空间大几个数量级。我们给出了一个近似团的公式，它是一个缺少一定数量边的团。我们利用近集团包含较小集团的事实，并使用集团抽样技术对近集团进行计数。这种方法允许我们在有数千万条边的图中计算有1条或2条缺失边的近团。据我们所知，没有已知的有效方法来解决这个问题，我们获得了比现有算法10 - 100倍的加速，用于计数近团。我们的主要技术是对Turán阴影采样方法的空间高效适应，该方法最近由Jain和Seshadhri (WWW 2017)介绍。这种方法构建了一个大的递归树(称为Turán Shadow)，它在图中表示派系。我们设计了一种新的算法，该算法使用Turán阴影的在线紧凑构造来构建近团估计器。

{"title":"Provably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS","authors":"Shweta Jain, C. Seshadhri","doi":"10.1145/3366423.3380264","DOIUrl":"https://doi.org/10.1145/3366423.3380264","url":null,"abstract":"Clique and near-clique counts are important graph properties with applications in graph generation, graph modeling, graph analytics, community detection among others. They are the archetypal examples of dense subgraphs. While there are several different definitions of near-cliques, most of them share the attribute that they are cliques that are missing a small number of edges. Clique counting is itself considered a challenging problem. Counting near-cliques is significantly harder more so since the search space for near-cliques is orders of magnitude larger than that of cliques. We give a formulation of a near-clique as a clique that is missing a constant number of edges. We exploit the fact that a near-clique contains a smaller clique, and use techniques for clique sampling to count near-cliques. This method allows us to count near-cliques with 1 or 2 missing edges, in graphs with tens of millions of edges. To the best of our knowledge, there was no known efficient method for this problem, and we obtain a 10x − 100x speedup over existing algorithms for counting near-cliques. Our main technique is a space efficient adaptation of the Turán Shadow sampling approach, recently introduced by Jain and Seshadhri (WWW 2017). This approach constructs a large recursion tree (called the Turán Shadow) that represents cliques in a graph. We design a novel algorithm that builds an estimator for near-cliques, using a online, compact construction of the Turán Shadow.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85872854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

CellRep: Usage Representativeness Modeling and Correction Based on Multiple City-Scale Cellular Networks CellRep:基于多城市规模蜂窝网络的使用代表性建模与校正

Proceedings of The Web Conference 2020

Pub Date : 2020-04-19 DOI: 10.1145/3366423.3380141

Zhihan Fang, Guang Wang, Shuai Wang, Chaoji Zuo, Fan Zhang, Desheng Zhang

Understanding representativeness in cellular web logs at city scale is essential for web applications. Most of the existing work on cellular web analyses or applications is built upon data from a single network in a city, which may not be representative of the overall usage patterns since multiple cellular networks coexist in most cities in the world. In this paper, we conduct the first comprehensive investigation of multiple cellular networks in a city with a 100% user penetration rate. We study web usage pattern (e.g., internet access services) correlation and difference between diverse cellular networks in terms of spatial and temporal dimensions to quantify the representativeness of web usage from a single network in usage patterns of all users in the same city. Moreover, relying on three external datasets, we study the correlation between the representativeness and contextual factors (e.g., Point-of-Interest, population, and mobility) to explain the potential causalities for the representativeness difference. We found that contextual diversity is a key reason for representativeness difference, and representativeness has a significant impact on the performance of real-world applications. Based on the analysis results, we further design a correction model to address the bias of single cellphone networks and improve representativeness by 45.8%.

了解城市规模的蜂窝网络日志的代表性对网络应用程序至关重要。蜂窝网络分析或应用程序的大多数现有工作都是建立在城市中单个网络的数据基础上的，这可能不能代表整体使用模式，因为世界上大多数城市中都存在多个蜂窝网络。在本文中，我们首次对用户渗透率为100%的城市中的多个蜂窝网络进行了全面调查。我们研究了网络使用模式(例如，互联网接入服务)在空间和时间维度上不同蜂窝网络之间的相关性和差异，以量化单一网络在同一城市所有用户使用模式中的代表性。此外，依靠三个外部数据集，我们研究了代表性与背景因素(例如，兴趣点，人口和流动性)之间的相关性，以解释代表性差异的潜在因果关系。我们发现上下文多样性是代表性差异的关键原因，代表性对现实世界应用程序的性能有显著影响。在分析结果的基础上，我们进一步设计了一个修正模型来解决单个手机网络的偏差，将代表性提高了45.8%。

{"title":"CellRep: Usage Representativeness Modeling and Correction Based on Multiple City-Scale Cellular Networks","authors":"Zhihan Fang, Guang Wang, Shuai Wang, Chaoji Zuo, Fan Zhang, Desheng Zhang","doi":"10.1145/3366423.3380141","DOIUrl":"https://doi.org/10.1145/3366423.3380141","url":null,"abstract":"Understanding representativeness in cellular web logs at city scale is essential for web applications. Most of the existing work on cellular web analyses or applications is built upon data from a single network in a city, which may not be representative of the overall usage patterns since multiple cellular networks coexist in most cities in the world. In this paper, we conduct the first comprehensive investigation of multiple cellular networks in a city with a 100% user penetration rate. We study web usage pattern (e.g., internet access services) correlation and difference between diverse cellular networks in terms of spatial and temporal dimensions to quantify the representativeness of web usage from a single network in usage patterns of all users in the same city. Moreover, relying on three external datasets, we study the correlation between the representativeness and contextual factors (e.g., Point-of-Interest, population, and mobility) to explain the potential causalities for the representativeness difference. We found that contextual diversity is a key reason for representativeness difference, and representativeness has a significant impact on the performance of real-world applications. Based on the analysis results, we further design a correction model to address the bias of single cellphone networks and improve representativeness by 45.8%.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88231172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Hierarchically Structured Transformer Networks for Fine-Grained Spatial Event Forecasting 用于细粒度空间事件预测的分层结构变压器网络

Proceedings of The Web Conference 2020

Pub Date : 2020-04-19 DOI: 10.1145/3366423.3380296

Xian Wu, Chao Huang, Chuxu Zhang, N. Chawla

Spatial event forecasting is challenging and crucial for urban sensing scenarios, which is beneficial for a wide spectrum of spatial-temporal mining applications, ranging from traffic management, public safety, to environment policy making. In spite of significant progress has been made to solve spatial-temporal prediction problem, most existing deep learning based methods based on a coarse-grained spatial setting and the success of such methods largely relies on data sufficiency. In many real-world applications, predicting events with a fine-grained spatial resolution do play a critical role to provide high discernibility of spatial-temporal data distributions. However, in such cases, applying existing methods will result in weak performance since they may not well capture the quality spatial-temporal representations when training triple instances are highly imbalanced across locations and time. To tackle this challenge, we develop a hierarchically structured Spatial-Temporal ransformer network (STtrans) which leverages a main embedding space to capture the inter-dependencies across time and space for alleviating the data imbalance issue. In our STtrans framework, the first-stage transformer module discriminates different types of region and time-wise relations. To make the latent spatial-temporal representations be reflective of the relational structure between categories, we further develop a cross-category fusion transformer network to endow STtrans with the capability to preserve the semantic signals in a fully dynamic manner. Finally, an adversarial training strategy is introduced to yield a robust spatial-temporal learning under data imbalance. Extensive experiments on real-world imbalanced spatial-temporal datasets from NYC and Chicago demonstrate the superiority of our method over various state-of-the-art baselines.

空间事件预测对于城市传感场景来说是具有挑战性和至关重要的，它有利于从交通管理、公共安全到环境政策制定等广泛的时空挖掘应用。尽管在解决时空预测问题方面取得了重大进展，但现有的大多数基于深度学习的方法都是基于粗粒度的空间设置，并且这些方法的成功很大程度上依赖于数据充分性。在许多实际应用中，具有细粒度空间分辨率的事件预测对于提供时空数据分布的高可辨性确实起着关键作用。然而，在这种情况下，应用现有方法将导致性能较弱，因为当训练的三个实例在位置和时间上高度不平衡时，它们可能无法很好地捕获高质量的时空表示。为了解决这一挑战，我们开发了一个分层结构的时空变换网络(STtrans)，它利用一个主要的嵌入空间来捕获跨时间和空间的相互依赖关系，以缓解数据不平衡问题。在我们的STtrans框架中，第一级变压器模块区分不同类型的区域和时间关系。为了使潜在的时空表征能够反映类别之间的关系结构，我们进一步开发了一个跨类别融合变压器网络，赋予STtrans以完全动态的方式保存语义信号的能力。最后，提出了一种对抗训练策略，以实现数据不平衡下的鲁棒时空学习。在纽约和芝加哥的真实不平衡时空数据集上进行的大量实验表明，我们的方法优于各种最先进的基线。

{"title":"Hierarchically Structured Transformer Networks for Fine-Grained Spatial Event Forecasting","authors":"Xian Wu, Chao Huang, Chuxu Zhang, N. Chawla","doi":"10.1145/3366423.3380296","DOIUrl":"https://doi.org/10.1145/3366423.3380296","url":null,"abstract":"Spatial event forecasting is challenging and crucial for urban sensing scenarios, which is beneficial for a wide spectrum of spatial-temporal mining applications, ranging from traffic management, public safety, to environment policy making. In spite of significant progress has been made to solve spatial-temporal prediction problem, most existing deep learning based methods based on a coarse-grained spatial setting and the success of such methods largely relies on data sufficiency. In many real-world applications, predicting events with a fine-grained spatial resolution do play a critical role to provide high discernibility of spatial-temporal data distributions. However, in such cases, applying existing methods will result in weak performance since they may not well capture the quality spatial-temporal representations when training triple instances are highly imbalanced across locations and time. To tackle this challenge, we develop a hierarchically structured Spatial-Temporal ransformer network (STtrans) which leverages a main embedding space to capture the inter-dependencies across time and space for alleviating the data imbalance issue. In our STtrans framework, the first-stage transformer module discriminates different types of region and time-wise relations. To make the latent spatial-temporal representations be reflective of the relational structure between categories, we further develop a cross-category fusion transformer network to endow STtrans with the capability to preserve the semantic signals in a fully dynamic manner. Finally, an adversarial training strategy is introduced to yield a robust spatial-temporal learning under data imbalance. Extensive experiments on real-world imbalanced spatial-temporal datasets from NYC and Chicago demonstrate the superiority of our method over various state-of-the-art baselines.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79670316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

eDarkFind: Unsupervised Multi-view Learning for Sybil Account Detection eDarkFind:用于Sybil账户检测的无监督多视图学习

Proceedings of The Web Conference 2020

Pub Date : 2020-04-19 DOI: 10.1145/3366423.3380263

Ramnath Kumar, S. Yadav, Raminta Daniulaityte, Francois R. Lamy, K. Thirunarayan, Usha Lokala, A. Sheth

Darknet crypto markets are online marketplaces using crypto currencies (e.g., Bitcoin, Monero) and advanced encryption techniques to offer anonymity to vendors and consumers trading for illegal goods or services. The exact volume of substances advertised and sold through these crypto markets is difficult to assess, at least partially, because vendors tend to maintain multiple accounts (or Sybil accounts) within and across different crypto markets. Linking these different accounts will allow us to accurately evaluate the volume of substances advertised across the different crypto markets by each vendor. In this paper, we present a multi-view unsupervised framework (eDarkFind) that helps modeling vendor characteristics and facilitates Sybil account detection. We employ a multi-view learning paradigm to generalize and improve the performance by exploiting the diverse views from multiple rich sources such as BERT, stylometric, and location representation. Our model is further tailored to take advantage of domain-specific knowledge such as the Drug Abuse Ontology to take into consideration the substance information. We performed extensive experiments and demonstrated that the multiple views obtained from diverse sources can be effective in linking Sybil accounts. Our proposed eDarkFind model achieves an accuracy of 98% on three real-world datasets which shows the generality of the approach.

暗网加密市场是使用加密货币(如比特币、门罗币)和先进加密技术的在线市场，为交易非法商品或服务的供应商和消费者提供匿名性。通过这些加密市场宣传和销售的物质的确切数量很难评估，至少部分很难评估，因为供应商倾向于在不同的加密市场内部和跨市场维护多个账户(或Sybil账户)。将这些不同的账户联系起来，将使我们能够准确地评估每个供应商在不同加密市场上宣传的物质的数量。在本文中，我们提出了一个多视图无监督框架(eDarkFind)，它有助于建模供应商特征并促进Sybil帐户检测。我们采用多视图学习范式，通过利用来自多个丰富来源的不同视图(如BERT、风格测量学和位置表示)来概括和提高性能。我们的模型进一步定制，以利用特定领域的知识，如药物滥用本体，以考虑物质信息。我们进行了大量的实验，并证明了从不同来源获得的多个视图可以有效地链接Sybil帐户。我们提出的eDarkFind模型在三个真实数据集上实现了98%的准确率，这表明了该方法的通用性。

{"title":"eDarkFind: Unsupervised Multi-view Learning for Sybil Account Detection","authors":"Ramnath Kumar, S. Yadav, Raminta Daniulaityte, Francois R. Lamy, K. Thirunarayan, Usha Lokala, A. Sheth","doi":"10.1145/3366423.3380263","DOIUrl":"https://doi.org/10.1145/3366423.3380263","url":null,"abstract":"Darknet crypto markets are online marketplaces using crypto currencies (e.g., Bitcoin, Monero) and advanced encryption techniques to offer anonymity to vendors and consumers trading for illegal goods or services. The exact volume of substances advertised and sold through these crypto markets is difficult to assess, at least partially, because vendors tend to maintain multiple accounts (or Sybil accounts) within and across different crypto markets. Linking these different accounts will allow us to accurately evaluate the volume of substances advertised across the different crypto markets by each vendor. In this paper, we present a multi-view unsupervised framework (eDarkFind) that helps modeling vendor characteristics and facilitates Sybil account detection. We employ a multi-view learning paradigm to generalize and improve the performance by exploiting the diverse views from multiple rich sources such as BERT, stylometric, and location representation. Our model is further tailored to take advantage of domain-specific knowledge such as the Drug Abuse Ontology to take into consideration the substance information. We performed extensive experiments and demonstrated that the multiple views obtained from diverse sources can be effective in linking Sybil accounts. Our proposed eDarkFind model achieves an accuracy of 98% on three real-world datasets which shows the generality of the approach.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83645316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Attention Please: Your Attention Check Questions in Survey Studies Can Be Automatically Answered 请注意:您的注意调查研究中的检查问题可以自动回答

Proceedings of The Web Conference 2020

Pub Date : 2020-04-19 DOI: 10.1145/3366423.3380195

Weiping Pei, Arthur Mayer, Kaylynn Tu, Chuan Yue

Attention check questions have become commonly used in online surveys published on popular crowdsourcing platforms as a key mechanism to filter out inattentive respondents and improve data quality. However, little research considers the vulnerabilities of this important quality control mechanism that can allow attackers including irresponsible and malicious respondents to automatically answer attention check questions for efficiently achieving their goals. In this paper, we perform the first study to investigate such vulnerabilities, and demonstrate that attackers can leverage deep learning techniques to pass attention check questions automatically. We propose AC-EasyPass, an attack framework with a concrete model, that combines convolutional neural network and weighted feature reconstruction to easily pass attention check questions. We construct the first attention check question dataset that consists of both original and augmented questions, and demonstrate the effectiveness of AC-EasyPass. We explore two simple defense methods, adding adversarial sentences and adding typos, for survey designers to mitigate the risks posed by AC-EasyPass; however, these methods are fragile due to their limitations from both technical and usability perspectives, underlining the challenging nature of defense. We hope our work will raise sufficient attention of the research community towards developing more robust attention check mechanisms. More broadly, our work intends to prompt the research community to seriously consider the emerging risks posed by the malicious use of machine learning techniques to the quality, validity, and trustworthiness of crowdsourcing and social computing.

注意力检查问题已成为流行众包平台发布的在线调查中常用的问题，作为过滤注意力不集中的受访者并提高数据质量的关键机制。然而，很少有研究考虑到这种重要的质量控制机制的漏洞，它可以允许攻击者(包括不负责任和恶意的受访者)自动回答注意力检查问题，以有效地实现他们的目标。在本文中，我们进行了第一次研究来调查此类漏洞，并证明攻击者可以利用深度学习技术自动通过注意力检查问题。我们提出了一种具有具体模型的攻击框架AC-EasyPass，该框架结合了卷积神经网络和加权特征重构，可以轻松通过注意力检查问题。我们构建了由原始问题和增强问题组成的第一个注意力检查问题数据集，并证明了AC-EasyPass的有效性。我们探讨了两种简单的防御方法，添加对抗性句子和添加错别字，为调查设计师减轻AC-EasyPass带来的风险;然而，由于技术和可用性方面的限制，这些方法是脆弱的，强调了防御的挑战性。我们希望我们的工作将引起研究界对开发更强大的注意力检查机制的足够关注。更广泛地说，我们的工作旨在促使研究界认真考虑恶意使用机器学习技术对众包和社会计算的质量、有效性和可信度所带来的新风险。

{"title":"Attention Please: Your Attention Check Questions in Survey Studies Can Be Automatically Answered","authors":"Weiping Pei, Arthur Mayer, Kaylynn Tu, Chuan Yue","doi":"10.1145/3366423.3380195","DOIUrl":"https://doi.org/10.1145/3366423.3380195","url":null,"abstract":"Attention check questions have become commonly used in online surveys published on popular crowdsourcing platforms as a key mechanism to filter out inattentive respondents and improve data quality. However, little research considers the vulnerabilities of this important quality control mechanism that can allow attackers including irresponsible and malicious respondents to automatically answer attention check questions for efficiently achieving their goals. In this paper, we perform the first study to investigate such vulnerabilities, and demonstrate that attackers can leverage deep learning techniques to pass attention check questions automatically. We propose AC-EasyPass, an attack framework with a concrete model, that combines convolutional neural network and weighted feature reconstruction to easily pass attention check questions. We construct the first attention check question dataset that consists of both original and augmented questions, and demonstrate the effectiveness of AC-EasyPass. We explore two simple defense methods, adding adversarial sentences and adding typos, for survey designers to mitigate the risks posed by AC-EasyPass; however, these methods are fragile due to their limitations from both technical and usability perspectives, underlining the challenging nature of defense. We hope our work will raise sufficient attention of the research community towards developing more robust attention check mechanisms. More broadly, our work intends to prompt the research community to seriously consider the emerging risks posed by the malicious use of machine learning techniques to the quality, validity, and trustworthiness of crowdsourcing and social computing.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89664787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Smaller, Faster & Lighter KNN Graph Constructions 更小，更快，更轻的KNN图结构

Proceedings of The Web Conference 2020

Pub Date : 2020-04-19 DOI: 10.1145/3366423.3380184

R. Guerraoui, Anne-Marie Kermarrec, Olivier Ruas, François Taïani

We propose GoldFinger, a new compact and fast-to-compute binary representation of datasets to approximate Jaccard’s index. We illustrate the effectiveness of GoldFinger on the emblematic big data problem of K-Nearest-Neighbor (KNN) graph construction and show that GoldFinger can drastically accelerate a large range of existing KNN algorithms with little to no overhead. As a side effect, we also show that the compact representation of the data protects users’ privacy for free by providing k-anonymity and l-diversity. Our extensive evaluation of the resulting approach on several realistic datasets shows that our approach delivers speedups of up to 78.9% compared to the use of raw data while only incurring a negligible to moderate loss in terms of KNN quality. To convey the practical value of such a scheme, we apply it to item recommendation and show that the loss in recommendation quality is negligible.

我们提出了GoldFinger，一个新的紧凑和快速计算的数据集二进制表示来近似Jaccard索引。我们说明了GoldFinger在k -最近邻(KNN)图构建的标志性大数据问题上的有效性，并表明GoldFinger可以在几乎没有开销的情况下大大加速现有的大量KNN算法。作为一个副作用，我们还表明数据的紧凑表示通过提供k-匿名性和l-多样性免费保护了用户的隐私。我们在几个实际数据集上对结果方法进行了广泛的评估，结果表明，与使用原始数据相比，我们的方法提供了高达78.9%的加速，而在KNN质量方面只产生了微不足道的到中等程度的损失。为了传达该方案的实用价值，我们将其应用于项目推荐，并证明了推荐质量的损失可以忽略不计。

引用次数: 8

Clustering with a faulty oracle 使用错误的oracle进行集群

Proceedings of The Web Conference 2020

Pub Date : 2020-04-19 DOI: 10.1145/3366423.3380045

Kasper Green Larsen, M. Mitzenmacher, Charalampos E. Tsourakakis

Clustering, i.e., finding groups in the data, is a problem that permeates multiple fields of science and engineering. Recently, the problem of clustering with a noisy oracle has drawn attention due to various applications including crowdsourced entity resolution [33], and predicting signs of interactions in large-scale online social networks [20, 21]. Here, we consider the following fundamental model for two clusters as proposed by Mitzenmacher and Tsourakakis [28], and Mazumdar and Saha [25]; there exist n items, belonging to two unknown groups. We are allowed to query any pair of nodes whether they belong to the same cluster or not, but the answer to the query is corrupted with some probability . Let 1 > δ = 1 − 2q > 0 be the bias. In this work, we provide a polynomial time algorithm that recovers all signs correctly with high probability in the presence of noise with queries. This is the best known result for this problem for all but tiny δ, improving on the current state-of-the-art due to Mazumdar and Saha [25].

聚类，即在数据中找到组，是一个渗透到多个科学和工程领域的问题。最近，由于各种应用，包括众包实体解析[33]和预测大规模在线社交网络中的交互迹象[20,21]，带有噪声oracle的聚类问题引起了人们的关注。在这里，我们考虑以下由Mitzenmacher和Tsourakakis[28]以及Mazumdar和Saha[25]提出的两个集群的基本模型;有n个项目，属于两个未知的组。我们可以查询任意一对节点是否属于同一集群，但查询的结果有一定的概率是错误的。设1 > δ = 1−2q > 0为偏置。在这项工作中，我们提供了一个多项式时间算法，该算法在查询中存在噪声的情况下以高概率正确恢复所有符号。这是除微小δ之外的所有δ问题的最著名结果，由于Mazumdar和Saha的研究[25]，这一结果在目前最先进的基础上得到了改进。

引用次数: 9

Crowdsourcing Detection of Sampling Biases in Image Datasets 图像数据集中抽样偏差的众包检测

Proceedings of The Web Conference 2020

Pub Date : 2020-04-19 DOI: 10.1145/3366423.3380063

Xiao Hu, Haobo Wang, Anirudh Vegesana, Somesh Dube, Kaiwen Yu, Gore Kao, Shuo-Han Chen, Yung-Hsiang Lu, G. Thiruvathukal, Ming Yin

Despite many exciting innovations in computer vision, recent studies reveal a number of risks in existing computer vision systems, suggesting results of such systems may be unfair and untrustworthy. Many of these risks can be partly attributed to the use of a training image dataset that exhibits sampling biases and thus does not accurately reflect the real visual world. Being able to detect potential sampling biases in the visual dataset prior to model development is thus essential for mitigating the fairness and trustworthy concerns in computer vision. In this paper, we propose a three-step crowdsourcing workflow to get humans into the loop for facilitating bias discovery in image datasets. Through two sets of evaluation studies, we find that the proposed workflow can effectively organize the crowd to detect sampling biases in both datasets that are artificially created with designed biases and real-world image datasets that are widely used in computer vision research and system development.

尽管计算机视觉领域有许多令人兴奋的创新，但最近的研究揭示了现有计算机视觉系统的一些风险，表明这些系统的结果可能是不公平和不可信的。许多这些风险可以部分归因于使用具有采样偏差的训练图像数据集，因此不能准确反映真实的视觉世界。因此，能够在模型开发之前检测到视觉数据集中潜在的抽样偏差对于减轻计算机视觉中的公平性和可信赖性问题至关重要。在本文中，我们提出了一个三步众包工作流程，让人类进入循环，以促进图像数据集中的偏见发现。通过两组评估研究，我们发现所提出的工作流程可以有效地组织人群来检测使用设计偏差人工创建的数据集和广泛用于计算机视觉研究和系统开发的真实图像数据集中的采样偏差。

引用次数: 16

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of The Web Conference 2020

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀