Proceedings of The Web Conference 2020最新文献_第7页

A Generic Solver Combining Unsupervised Learning and Representation Learning for Breaking Text-Based Captchas 结合无监督学习和表示学习的破解文本验证码的通用求解器

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380166

Sheng Tian, T. Xiong

Although there are many alternative captcha schemes available, text-based captchas are still one of the most popular security mechanism to maintain Internet security and prevent malicious attacks, due to the user preferences and ease of design. Over the past decade, different methods of breaking captchas have been proposed, which helps captcha keep evolving and become more robust. However, these previous works generally require heavy expert involvement and gradually become ineffective with the introduction of new security features. This paper proposes a generic solver combining unsupervised learning and representation learning to automatically remove the noisy background of captchas and solve text-based captchas. We introduce a new training scheme for constructing mini-batches, which contain a large number of unlabeled hard examples, to improve the efficiency of representation learning. Unlike existing deep learning algorithms, our method requires significantly fewer labeled samples and surpasses the recognition performance of a fully-supervised model with the same network architecture. Moreover, extensive experiments show that the proposed method outperforms state-of-the-art by delivering a higher accuracy on various captcha schemes. We provide further discussions of potential applications of the proposed unified framework. We hope that our work can inspire the community to enhance the security of text-based captchas.

尽管有许多替代验证码方案可用，但基于文本的验证码仍然是维护互联网安全和防止恶意攻击的最流行的安全机制之一，因为用户偏好和易于设计。在过去的十年里，人们提出了不同的破解验证码的方法，这有助于验证码不断发展并变得更加强大。然而，这些以前的工作通常需要大量的专家参与，并且随着新的安全特性的引入逐渐变得无效。本文提出了一种结合无监督学习和表示学习的通用求解器，用于自动去除验证码的噪声背景，求解基于文本的验证码。为了提高表示学习的效率，我们引入了一种新的训练方案来构造包含大量未标记的难样例的小批量。与现有的深度学习算法不同，我们的方法需要更少的标记样本，并且超过了具有相同网络架构的全监督模型的识别性能。此外，大量的实验表明，所提出的方法通过在各种验证码方案中提供更高的准确性而优于最先进的技术。我们进一步讨论了所提议的统一框架的潜在应用。我们希望我们的工作能够激励社会提高基于文本的验证码的安全性。

{"title":"A Generic Solver Combining Unsupervised Learning and Representation Learning for Breaking Text-Based Captchas","authors":"Sheng Tian, T. Xiong","doi":"10.1145/3366423.3380166","DOIUrl":"https://doi.org/10.1145/3366423.3380166","url":null,"abstract":"Although there are many alternative captcha schemes available, text-based captchas are still one of the most popular security mechanism to maintain Internet security and prevent malicious attacks, due to the user preferences and ease of design. Over the past decade, different methods of breaking captchas have been proposed, which helps captcha keep evolving and become more robust. However, these previous works generally require heavy expert involvement and gradually become ineffective with the introduction of new security features. This paper proposes a generic solver combining unsupervised learning and representation learning to automatically remove the noisy background of captchas and solve text-based captchas. We introduce a new training scheme for constructing mini-batches, which contain a large number of unlabeled hard examples, to improve the efficiency of representation learning. Unlike existing deep learning algorithms, our method requires significantly fewer labeled samples and surpasses the recognition performance of a fully-supervised model with the same network architecture. Moreover, extensive experiments show that the proposed method outperforms state-of-the-art by delivering a higher accuracy on various captcha schemes. We provide further discussions of potential applications of the proposed unified framework. We hope that our work can inspire the community to enhance the security of text-based captchas.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"341 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76394348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Multi-Context Attention for Entity Matching 实体匹配的多上下文关注

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380017

Dongxiang Zhang, Yuyang Nie, Sai Wu, Yanyan Shen, K. Tan

Entity matching (EM) is a classic research problem that identifies data instances referring to the same real-world entity. Recent technical trend in this area is to take advantage of deep learning (DL) to automatically extract discriminative features. DeepER and DeepMatcher have emerged as two pioneering DL models for EM. However, these two state-of-the-art solutions simply incorporate vanilla RNNs and straightforward attention mechanisms. In this paper, we fully exploit the semantic context of embedding vectors for the pair of entity text descriptions. In particular, we propose an integrated multi-context attention framework that takes into account self-attention, pair-attention and global-attention from three types of context. The idea is further extended to incorporate attribute attention in order to support structured datasets. We conduct extensive experiments with 7 benchmark datasets that are publicly accessible. The experimental results clearly establish our superiority over DeepER and DeepMatcher in all the datasets.

实体匹配(EM)是识别引用相同现实世界实体的数据实例的经典研究问题。近年来该领域的技术趋势是利用深度学习(DL)来自动提取判别特征。deep和DeepMatcher已经成为EM的两个开创性深度学习模型。然而，这两个最先进的解决方案只是简单地结合了普通的rnn和简单的注意力机制。在本文中，我们充分利用了嵌入向量对实体文本描述的语义上下文。我们特别提出了一个综合的多语境注意框架，该框架考虑了三种类型语境中的自我注意、配对注意和全局注意。为了支持结构化数据集，这个想法被进一步扩展到包含属性关注。我们对7个可公开访问的基准数据集进行了广泛的实验。实验结果清楚地证明了我们在所有数据集上优于deep和DeepMatcher。

引用次数: 24

Multimodal Post Attentive Profiling for Influencer Marketing 影响者营销的多模式后关注分析

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380052

Seungbae Kim, Jyun-Yu Jiang, Masaki Nakada, Jinyoung Han, Wei Wang

Influencer marketing has become a key marketing method for brands in recent years. Hence, brands have been increasingly utilizing influencers’ social networks to reach niche markets, and researchers have been studying various aspects of influencer marketing. However, brands have often suffered from searching and hiring the right influencers with specific interests/topics for their marketing due to a lack of available influencer data and/or limited capacity of marketing agencies. This paper proposes a multimodal deep learning model that uses text and image information from social media posts (i) to classify influencers into specific interests/topics (e.g., fashion, beauty) and (ii) to classify their posts into certain categories. We use the attention mechanism to select the posts that are more relevant to the topics of influencers, thereby generating useful influencer representations. We conduct experiments on the dataset crawled from Instagram, which is the most popular social media for influencer marketing. The experimental results show that our proposed model significantly outperforms existing user profiling methods by achieving 98% and 96% accuracy in classifying influencers and their posts, respectively. We release our influencer dataset of 33,935 influencers labeled with specific topics based on 10,180,500 posts to facilitate future research.

近年来，网红营销已成为品牌营销的重要手段。因此，品牌越来越多地利用网红的社交网络来进入利基市场，研究人员一直在研究网红营销的各个方面。然而，由于缺乏可用的影响者数据和/或营销机构的能力有限，品牌经常在寻找和雇用具有特定兴趣/主题的合适影响者进行营销时遇到麻烦。本文提出了一个多模态深度学习模型，该模型使用来自社交媒体帖子的文本和图像信息(i)将网红分类为特定的兴趣/主题(例如，时尚，美容)，以及(ii)将他们的帖子分类为某些类别。我们使用注意力机制来选择与网红主题更相关的帖子，从而生成有用的网红表示。我们对从Instagram抓取的数据集进行了实验，Instagram是最受欢迎的网红营销社交媒体。实验结果表明，我们提出的模型在对影响者及其帖子进行分类方面分别达到98%和96%的准确率，显著优于现有的用户分析方法。我们发布了33,935名影响者的影响者数据集，这些影响者基于10,180,500个帖子标记了特定主题，以促进未来的研究。

{"title":"Multimodal Post Attentive Profiling for Influencer Marketing","authors":"Seungbae Kim, Jyun-Yu Jiang, Masaki Nakada, Jinyoung Han, Wei Wang","doi":"10.1145/3366423.3380052","DOIUrl":"https://doi.org/10.1145/3366423.3380052","url":null,"abstract":"Influencer marketing has become a key marketing method for brands in recent years. Hence, brands have been increasingly utilizing influencers’ social networks to reach niche markets, and researchers have been studying various aspects of influencer marketing. However, brands have often suffered from searching and hiring the right influencers with specific interests/topics for their marketing due to a lack of available influencer data and/or limited capacity of marketing agencies. This paper proposes a multimodal deep learning model that uses text and image information from social media posts (i) to classify influencers into specific interests/topics (e.g., fashion, beauty) and (ii) to classify their posts into certain categories. We use the attention mechanism to select the posts that are more relevant to the topics of influencers, thereby generating useful influencer representations. We conduct experiments on the dataset crawled from Instagram, which is the most popular social media for influencer marketing. The experimental results show that our proposed model significantly outperforms existing user profiling methods by achieving 98% and 96% accuracy in classifying influencers and their posts, respectively. We release our influencer dataset of 33,935 influencers labeled with specific topics based on 10,180,500 posts to facilitate future research.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90042786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Few-Sample and Adversarial Representation Learning for Continual Stream Mining 连续流挖掘的少样本和对抗表示学习

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380153

Zhuoyi Wang, Yigong Wang, Yu Lin, Evan Delord, L. Khan

Deep Neural Networks (DNNs) have primarily been demonstrated to be useful for closed-world classification problems where the number of categories is fixed. However, DNNs notoriously fail when tasked with label prediction in a non-stationary data stream scenario, which has the continuous emergence of the unknown or novel class (categories not in the training set). For example, new topics continually emerge in social media or e-commerce. To solve this challenge, a DNN should not only be able to detect the novel class effectively but also incrementally learn new concepts from limited samples over time. Literature that addresses both problems simultaneously is limited. In this paper, we focus on improving the generalization of the model on the novel classes, and making the model continually learn from only a few samples from the novel categories. Different from existing approaches that rely on abundant labeled instances to re-train/update the model, we propose a new approach based on Few Sample and Adversarial Representation Learning (FSAR). The key novelty is that we introduce the adversarial confusion term into both the representation learning and few-sample learning process, which reduces the over-confidence of the model on the seen classes, further enhance the generalization of the model to detect and learn new categories with only a few samples. We train the FSAR operated in two stages: first, FSAR learns an intra-class compacted and inter-class separated feature embedding to detect the novel classes; next, we collect a few labeled samples belong to the new categories, utilize episode-training to exploit the intrinsic features for few-sample learning. We evaluated FSAR on different datasets, using extensive experimental results from various simulated stream benchmarks to show that FSAR effectively outperforms current state-of-the-art approaches.

深度神经网络(dnn)已被证明主要用于封闭世界分类问题，其中类别数量是固定的。然而，当dnn在非平稳数据流场景中进行标签预测时，会出现未知或新类(不在训练集中的类别)的不断出现，这是出了名的失败。例如，社交媒体或电子商务中不断出现新的话题。为了解决这一挑战，深度神经网络不仅要能够有效地检测新类别，还要能够随着时间的推移从有限的样本中逐步学习新概念。同时解决这两个问题的文献是有限的。在本文中，我们的重点是提高模型在新类别上的泛化能力，使模型只从新类别的少数样本中进行持续学习。与现有的依赖大量标记实例来重新训练/更新模型的方法不同，我们提出了一种基于少样本和对抗表示学习(FSAR)的新方法。关键的新颖之处在于，我们在表示学习和少样本学习过程中都引入了对抗混淆项，这减少了模型对已知类别的过度置信度，进一步增强了模型的泛化能力，可以用少量样本来检测和学习新的类别。我们分两个阶段对FSAR进行训练:首先，FSAR学习类内压缩和类间分离的特征嵌入来检测新的类;接下来，我们收集一些属于新类别的标记样本，利用情节训练来挖掘其内在特征进行少样本学习。我们在不同的数据集上评估了FSAR，使用了来自各种模拟流基准的大量实验结果，以表明FSAR有效地优于当前最先进的方法。

{"title":"Few-Sample and Adversarial Representation Learning for Continual Stream Mining","authors":"Zhuoyi Wang, Yigong Wang, Yu Lin, Evan Delord, L. Khan","doi":"10.1145/3366423.3380153","DOIUrl":"https://doi.org/10.1145/3366423.3380153","url":null,"abstract":"Deep Neural Networks (DNNs) have primarily been demonstrated to be useful for closed-world classification problems where the number of categories is fixed. However, DNNs notoriously fail when tasked with label prediction in a non-stationary data stream scenario, which has the continuous emergence of the unknown or novel class (categories not in the training set). For example, new topics continually emerge in social media or e-commerce. To solve this challenge, a DNN should not only be able to detect the novel class effectively but also incrementally learn new concepts from limited samples over time. Literature that addresses both problems simultaneously is limited. In this paper, we focus on improving the generalization of the model on the novel classes, and making the model continually learn from only a few samples from the novel categories. Different from existing approaches that rely on abundant labeled instances to re-train/update the model, we propose a new approach based on Few Sample and Adversarial Representation Learning (FSAR). The key novelty is that we introduce the adversarial confusion term into both the representation learning and few-sample learning process, which reduces the over-confidence of the model on the seen classes, further enhance the generalization of the model to detect and learn new categories with only a few samples. We train the FSAR operated in two stages: first, FSAR learns an intra-class compacted and inter-class separated feature embedding to detect the novel classes; next, we collect a few labeled samples belong to the new categories, utilize episode-training to exploit the intrinsic features for few-sample learning. We evaluated FSAR on different datasets, using extensive experimental results from various simulated stream benchmarks to show that FSAR effectively outperforms current state-of-the-art approaches.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88059010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Deconstructing Google’s Web Light Service 解构b谷歌的Web Light Service

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380168

Ammar Tahir, Muhammad Tahir Munir, Shaiq Munir Malik, Z. Qazi, I. Qazi

Web Light is a transcoding service introduced by Google to show lighter and faster webpages to users searching on slow mobile clients. The service detects slow clients (e.g., users on 2G) and tries to convert webpages on the fly into a version optimized for these clients. Web Light claims to significantly reduce page load times, save user data, and substantially increase traffic to such webpages. However, there are several concerns around this service, including, its effectiveness in, preserving relevant content on a page, showing third-party advertisements, improving user performance as well as privacy concerns for users and publishers. In this paper, we perform the first independent, empirical analysis of Google’s Web Light service to shed light on these concerns. Through a combination of experiments with thousands of real Web Light pages as well as controlled experiments with synthetic Web Light pages, we (i) deconstruct how Web Light modifies webpages, (ii) investigate how ads are shown on Web Light and which ad networks are supported, (iii) measure and compare Web Light’s page load performance, (iv) discuss privacy concerns for users and publishers and (v) investigate the potential use of Web Light as a censorship circumvention tool.

Web Light是谷歌推出的一项转码服务，为在缓慢的移动客户端上搜索的用户显示更轻、更快的网页。该服务检测速度较慢的客户端(例如2G用户)，并尝试将网页动态转换为针对这些客户端优化的版本。Web Light声称可以显著减少页面加载时间，节省用户数据，并大大增加此类网页的流量。然而，围绕这项服务存在一些问题，包括它在保留页面上相关内容、显示第三方广告、提高用户性能以及用户和发布者的隐私问题方面的有效性。在本文中，我们对Google的Web Light服务进行了首次独立的实证分析，以阐明这些问题。通过对数千个真实Web Light页面的实验，以及对合成Web Light页面的对照实验，我们(i)解构Web Light如何修改网页，(ii)调查广告如何在Web Light上显示，以及支持哪些广告网络，(iii)测量和比较Web Light的页面加载性能，(iv)讨论用户和出版商的隐私问题，以及(v)调查Web Light作为审查规避工具的潜在用途。

{"title":"Deconstructing Google’s Web Light Service","authors":"Ammar Tahir, Muhammad Tahir Munir, Shaiq Munir Malik, Z. Qazi, I. Qazi","doi":"10.1145/3366423.3380168","DOIUrl":"https://doi.org/10.1145/3366423.3380168","url":null,"abstract":"Web Light is a transcoding service introduced by Google to show lighter and faster webpages to users searching on slow mobile clients. The service detects slow clients (e.g., users on 2G) and tries to convert webpages on the fly into a version optimized for these clients. Web Light claims to significantly reduce page load times, save user data, and substantially increase traffic to such webpages. However, there are several concerns around this service, including, its effectiveness in, preserving relevant content on a page, showing third-party advertisements, improving user performance as well as privacy concerns for users and publishers. In this paper, we perform the first independent, empirical analysis of Google’s Web Light service to shed light on these concerns. Through a combination of experiments with thousands of real Web Light pages as well as controlled experiments with synthetic Web Light pages, we (i) deconstruct how Web Light modifies webpages, (ii) investigate how ads are shown on Web Light and which ad networks are supported, (iii) measure and compare Web Light’s page load performance, (iv) discuss privacy concerns for users and publishers and (v) investigate the potential use of Web Light as a censorship circumvention tool.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85962422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Dynamic Composition for Conversational Domain Exploration 会话领域探索的动态组合

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380167

Idan Szpektor, Deborah Cohen, G. Elidan, Michael Fink, A. Hassidim, Orgad Keller, Sayalı, Kulkarni, E. Ofek, S. Pudinsky, Asaf Revach, Shimi Salant

We study conversational domain exploration (CODEX), where the user’s goal is to enrich her knowledge of a given domain by conversing with an informative bot. Such conversations should be well grounded in high-quality domain knowledge as well as engaging and open-ended. A CODEX bot should be proactive and introduce relevant information even if not directly asked for by the user. The bot should also appropriately pivot the conversation to undiscovered regions of the domain. To address these dialogue characteristics, we introduce a novel approach termed dynamic composition that decouples candidate content generation from the flexible composition of bot responses. This allows the bot to control the source, correctness and quality of the offered content, while achieving flexibility via a dialogue manager that selects the most appropriate contents in a compositional manner. We implemented a CODEX bot based on dynamic composition and integrated it into the Google Assistant . As an example domain, the bot conversed about the NBA basketball league in a seamless experience, such that users were not aware whether they were conversing with the vanilla system or the one augmented with our CODEX bot. Results are positive and offer insights into what makes for a good conversation. To the best of our knowledge, this is the first real user experiment of open-ended dialogues as part of a commercial assistant system.

我们研究会话领域探索(CODEX)，其中用户的目标是通过与信息型机器人交谈来丰富她对给定领域的知识。这样的对话应该以高质量的领域知识为基础，并且具有吸引力和开放性。食品法典机器人应积极主动，即使用户没有直接要求，也应介绍相关信息。机器人还应该适当地将对话转向域的未被发现的区域。为了解决这些对话特征，我们引入了一种称为动态组合的新方法，该方法将候选内容生成与机器人响应的灵活组合解耦。这允许机器人控制所提供内容的来源、正确性和质量，同时通过对话管理器实现灵活性，以组合的方式选择最合适的内容。我们实现了一个基于动态合成的CODEX机器人，并将其集成到Google Assistant中。作为一个示例域，机器人在无缝体验中谈论NBA篮球联赛，这样用户就不知道他们是在与香草系统交谈还是与我们的CODEX机器人增强的系统交谈。结果是积极的，并提供了如何进行良好对话的见解。据我们所知，这是作为商业辅助系统一部分的开放式对话的第一个真正的用户实验。

{"title":"Dynamic Composition for Conversational Domain Exploration","authors":"Idan Szpektor, Deborah Cohen, G. Elidan, Michael Fink, A. Hassidim, Orgad Keller, Sayalı, Kulkarni, E. Ofek, S. Pudinsky, Asaf Revach, Shimi Salant","doi":"10.1145/3366423.3380167","DOIUrl":"https://doi.org/10.1145/3366423.3380167","url":null,"abstract":"We study conversational domain exploration (CODEX), where the user’s goal is to enrich her knowledge of a given domain by conversing with an informative bot. Such conversations should be well grounded in high-quality domain knowledge as well as engaging and open-ended. A CODEX bot should be proactive and introduce relevant information even if not directly asked for by the user. The bot should also appropriately pivot the conversation to undiscovered regions of the domain. To address these dialogue characteristics, we introduce a novel approach termed dynamic composition that decouples candidate content generation from the flexible composition of bot responses. This allows the bot to control the source, correctness and quality of the offered content, while achieving flexibility via a dialogue manager that selects the most appropriate contents in a compositional manner. We implemented a CODEX bot based on dynamic composition and integrated it into the Google Assistant . As an example domain, the bot conversed about the NBA basketball league in a seamless experience, such that users were not aware whether they were conversing with the vanilla system or the one augmented with our CODEX bot. Results are positive and offer insights into what makes for a good conversation. To the best of our knowledge, this is the first real user experiment of open-ended dialogues as part of a commercial assistant system.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80124187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Fast Computation of Explanations for Inconsistency in Large-Scale Knowledge Graphs 大规模知识图中不一致解释的快速计算

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380014

T. Tran, Mohamed H. Gad-Elrab, D. Stepanova, E. Kharlamov, Jannik Strotgen

Knowledge graphs (KGs) are essential resources for many applications including Web search and question answering. As KGs are often automatically constructed, they may contain incorrect facts. Detecting them is a crucial, yet extremely expensive task. Prominent solutions detect and explain inconsistency in KGs with respect to accompanying ontologies that describe the KG domain of interest. Compared to machine learning methods they are more reliable and human-interpretable but scale poorly on large KGs. In this paper, we present a novel approach to dramatically speed up the process of detecting and explaining inconsistency in large KGs by exploiting KG abstractions that capture prominent data patterns. Though much smaller, KG abstractions preserve inconsistency and their explanations. Our experiments with large KGs (e.g., DBpedia and Yago) demonstrate the feasibility of our approach and show that it significantly outperforms the popular baseline.

知识图(KGs)是包括Web搜索和问题回答在内的许多应用程序的基本资源。由于kg通常是自动构建的，因此它们可能包含不正确的事实。探测它们是一项至关重要但又极其昂贵的任务。突出的解决方案检测和解释KG中与描述感兴趣的KG域相关的本体的不一致。与机器学习方法相比，它们更可靠，更易于人类解释，但在大型KG上的可扩展性较差。在本文中，我们提出了一种新方法，通过利用捕获突出数据模式的KG抽象来显著加快大型KG中检测和解释不一致的过程。KG抽象虽然小得多，但保留了不一致及其解释。我们对大型kg(例如，DBpedia和Yago)的实验证明了我们的方法的可行性，并表明它明显优于流行的基线。

引用次数: 9

Measurements, Analyses, and Insights on the Entire Ethereum Blockchain Network 整个以太坊区块链网络的测量、分析和见解

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380103

Xi Tong Lee, Arijit Khan, Sourav Sengupta, Yu-Han Ong, Xu Liu

Blockchains are increasingly becoming popular due to the prevalence of cryptocurrencies and decentralized applications. Ethereum is a distributed public blockchain network that focuses on running code (smart contracts) for decentralized applications. More simply, it is a platform for sharing information in a global state that cannot be manipulated or changed. Ethereum blockchain introduces a novel ecosystem of human users and autonomous agents (smart contracts). In this network, we are interested in all possible interactions: user-to-user, user-to-contract, contract-to-user, and contract-to-contract. This requires us to construct interaction networks from the entire Ethereum blockchain data, where vertices are accounts (users, contracts) and arcs denote interactions. Our analyses on the networks reveal new insights by combining information from the four networks. We perform an in-depth study of these networks based on several graph properties consisting of both local and global properties, discuss their similarities and differences with social networks and the Web, draw interesting conclusions, and highlight important, future research directions.

由于加密货币和去中心化应用的流行，区块链越来越受欢迎。以太坊是一个分布式公共区块链网络，专注于为分散应用程序运行代码(智能合约)。更简单地说，它是一个在无法操纵或改变的全球状态下共享信息的平台。以太坊区块链引入了一个由人类用户和自主代理(智能合约)组成的新型生态系统。在这个网络中，我们对所有可能的交互感兴趣:用户对用户、用户对合同、合同对用户和合同对合同。这需要我们从整个以太坊区块链数据构建交互网络，其中顶点是账户(用户，合约)，弧线表示交互。我们对网络的分析通过结合来自四个网络的信息揭示了新的见解。我们对这些网络进行了深入的研究，这些研究基于由局部和全局属性组成的几个图属性，讨论了它们与社交网络和Web的异同，得出了有趣的结论，并强调了重要的未来研究方向。

{"title":"Measurements, Analyses, and Insights on the Entire Ethereum Blockchain Network","authors":"Xi Tong Lee, Arijit Khan, Sourav Sengupta, Yu-Han Ong, Xu Liu","doi":"10.1145/3366423.3380103","DOIUrl":"https://doi.org/10.1145/3366423.3380103","url":null,"abstract":"Blockchains are increasingly becoming popular due to the prevalence of cryptocurrencies and decentralized applications. Ethereum is a distributed public blockchain network that focuses on running code (smart contracts) for decentralized applications. More simply, it is a platform for sharing information in a global state that cannot be manipulated or changed. Ethereum blockchain introduces a novel ecosystem of human users and autonomous agents (smart contracts). In this network, we are interested in all possible interactions: user-to-user, user-to-contract, contract-to-user, and contract-to-contract. This requires us to construct interaction networks from the entire Ethereum blockchain data, where vertices are accounts (users, contracts) and arcs denote interactions. Our analyses on the networks reveal new insights by combining information from the four networks. We perform an in-depth study of these networks based on several graph properties consisting of both local and global properties, discuss their similarities and differences with social networks and the Web, draw interesting conclusions, and highlight important, future research directions.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81733198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

PG2S+: Stack Distance Construction Using Popularity, Gap and Machine Learning PG2S+:使用流行度，差距和机器学习构建堆栈距离

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380176

Jiangwei Zhang, Y. Tay

Stack distance characterizes temporal locality of workloads and plays a vital role in cache analysis since the 1970s. However, exact stack distance calculation is too costly, and impractical for online use. Hence, much work was done to optimize the exact computation, or approximate it through sampling or modeling. This paper introduces a new approximation technique PG2S that is based on reference popularity and gap distance. This approximation is exact under the Independent Reference Model (IRM). The technique is further extended, using machine learning, to PG2S+ for non-IRM reference patterns. Extensive experiments show that PG2S+ is much more accurate and robust than other state-of-the-art algorithms for determining stack distance. PG2S+ is the first technique to exploit the strong correlation among reference popularity, gap distance and stack distance.

自20世纪70年代以来，堆栈距离表征了工作负载的时间局部性，在缓存分析中起着至关重要的作用。然而，精确的堆栈距离计算过于昂贵，并且不适合在线使用。因此，需要做大量的工作来优化精确的计算，或者通过采样或建模来近似计算。本文介绍了一种新的基于参考度和间隙距离的近似技术PG2S。这种近似在独立参考模型(IRM)下是精确的。该技术使用机器学习进一步扩展到PG2S+，用于非irm参考模式。大量的实验表明，PG2S+在确定堆栈距离方面比其他最先进的算法更加准确和稳健。PG2S+是第一个利用参考度、间隙距离和堆栈距离之间强相关性的技术。

引用次数: 3

Interpretable Complex Question Answering 可解释的复杂问题回答

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380764

Soumen Chakrabarti

We will review cross-community co-evolution of question answering (QA) with the advent of large-scale knowledge graphs (KGs), continuous representations of text and graphs, and deep sequence analysis. Early QA systems were information retrieval (IR) systems enhanced to extract named entity spans from high-scoring passages. Starting with WordNet, a series of structured curations of language and world knowledge, called KGs, enabled further improvements. Corpus is unstructured and messy to exploit for QA. If a question can be answered using the KG alone, it is attractive to ‘interpret’ the free-form question into a structured query, which is then executed on the structured KG. This process is called KGQA. Answers can be high-quality and explainable if the KG has an answer, but manual curation results in low coverage. KGs were soon found useful to harness corpus information. Named entity mention spans could be tagged with fine-grained types (e.g., scientist), or even specific entities (e.g., Einstein). The QA system can learn to decompose a query into functional parts, e.g., “which scientist” and “played the violin”. With increasing success of such systems, ambition grew to address multi-hop or multi-clause queries, e.g., “the father of the director of La La Land teaches at which university?” or “who directed an award-winning movie and is the son of a Princeton University professor?” Questions limited to simple path traversals in KGs have been encoded to a vector representation, which a decoder then uses to guide the KG traversal. Recently the corpus counterpart of such strategies has also been proposed. However, for general multi-clause queries that do not necessarily translate to paths, and seek to bind multiple variables to satisfy multiple clauses, or involve logic, comparison, aggregation and other arithmetic, neural programmer-interpreter systems have seen some success. Our key focus will be on identifying situations where manual introduction of structural bias is essential for accuracy, as against cases where sufficient data can get around distant or no supervision.

随着大规模知识图(KGs)、文本和图的连续表示以及深度序列分析的出现，我们将回顾问答(QA)的跨社区协同进化。早期的QA系统是信息检索(IR)系统，用于从高分段落中提取命名实体跨度。从WordNet开始，一系列结构化的语言和世界知识管理，称为KGs，使进一步的改进成为可能。语料库是非结构化的，难以用于QA。如果可以单独使用KG回答一个问题，那么将自由形式的问题“解释”为结构化查询是很有吸引力的，然后在结构化KG上执行该查询。这个过程被称为KGQA。如果KG有答案，答案可以是高质量的和可解释的，但是手动策展导致低覆盖率。kg很快被发现对利用语料库信息很有用。可以用细粒度类型(例如，科学家)甚至特定实体(例如，爱因斯坦)标记命名实体提及范围。QA系统可以学习将查询分解为功能部分，例如，“哪个科学家”和“拉小提琴”。随着这类系统的日益成功，人们越来越希望解决多跳或多句查询，例如，“《爱乐之城》导演的父亲在哪所大学任教?”或者“谁导演了一部获奖电影，还是普林斯顿大学教授的儿子?”在KG中仅限于简单路径遍历的问题被编码为向量表示，然后解码器使用它来指导KG遍历。最近也有人提出了这种策略的语料库对应。然而，对于一般的多子句查询，不一定要转换为路径，并且寻求绑定多个变量以满足多个子句，或者涉及逻辑、比较、聚合和其他算法，神经编程-解释器系统已经取得了一些成功。我们的重点将放在识别人工引入结构偏差对准确性至关重要的情况，而不是在足够的数据可以绕过远程或无监督的情况下。

{"title":"Interpretable Complex Question Answering","authors":"Soumen Chakrabarti","doi":"10.1145/3366423.3380764","DOIUrl":"https://doi.org/10.1145/3366423.3380764","url":null,"abstract":"We will review cross-community co-evolution of question answering (QA) with the advent of large-scale knowledge graphs (KGs), continuous representations of text and graphs, and deep sequence analysis. Early QA systems were information retrieval (IR) systems enhanced to extract named entity spans from high-scoring passages. Starting with WordNet, a series of structured curations of language and world knowledge, called KGs, enabled further improvements. Corpus is unstructured and messy to exploit for QA. If a question can be answered using the KG alone, it is attractive to ‘interpret’ the free-form question into a structured query, which is then executed on the structured KG. This process is called KGQA. Answers can be high-quality and explainable if the KG has an answer, but manual curation results in low coverage. KGs were soon found useful to harness corpus information. Named entity mention spans could be tagged with fine-grained types (e.g., scientist), or even specific entities (e.g., Einstein). The QA system can learn to decompose a query into functional parts, e.g., “which scientist” and “played the violin”. With increasing success of such systems, ambition grew to address multi-hop or multi-clause queries, e.g., “the father of the director of La La Land teaches at which university?” or “who directed an award-winning movie and is the son of a Princeton University professor?” Questions limited to simple path traversals in KGs have been encoded to a vector representation, which a decoder then uses to guide the KG traversal. Recently the corpus counterpart of such strategies has also been proposed. However, for general multi-clause queries that do not necessarily translate to paths, and seek to bind multiple variables to satisfy multiple clauses, or involve logic, comparison, aggregation and other arithmetic, neural programmer-interpreter systems have seen some success. Our key focus will be on identifying situations where manual introduction of structural bias is essential for accuracy, as against cases where sufficient data can get around distant or no supervision.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87923520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4