Proceedings of the Web Conference 2021最新文献_第7页

GNEM: A Generic One-to-Set Neural Entity Matching Framework 通用的一对集神经实体匹配框架

Proceedings of the Web Conference 2021

Pub Date : 2021-04-19 DOI: 10.1145/3442381.3450119

Runjin Chen, Yanyan Shen, Dongxiang Zhang

Entity Matching is a classic research problem in any data analytics pipeline, aiming to identify records referring to the same real-world entity. It plays an important role in data cleansing and integration. Advanced entity matching techniques focus on extracting syntactic or semantic features from record pairs via complex neural architectures or pre-trained language models. However, the performances always suffer from noisy or missing attribute values in the records. We observe that comparing one record with several relevant records in a collective manner allows each pairwise matching decision to be made by borrowing valuable insights from other pairs, which is beneficial to the overall matching performance. In this paper, we propose a generic one-to-set neural framework named GNEM for entity matching. GNEM predicts matching labels between one record and a set of relevant records simultaneously. It constructs a record pair graph with weighted edges and adopts the graph neural network to propagate information among pairs. We further show that GNEM can be interpreted as an extension and generalization of the existing pairwise matching techniques. Extensive experiments on real-world data sets demonstrate that GNEM consistently outperforms the existing pairwise entity matching techniques and achieves up to 8.4% improvement on F1-Score compared with the state-of-the-art neural methods.

实体匹配是任何数据分析管道中的经典研究问题，旨在识别引用相同现实世界实体的记录。它在数据清理和集成中起着重要的作用。高级实体匹配技术侧重于通过复杂的神经结构或预训练的语言模型从记录对中提取语法或语义特征。然而，表演总是受到记录中嘈杂或缺失属性值的影响。我们观察到，以集体的方式将一条记录与几条相关记录进行比较，可以使每对配对决策通过借鉴其他对的有价值的见解来做出，这有利于整体匹配性能。本文提出了一种通用的一对集神经网络框架GNEM用于实体匹配。GNEM同时预测一条记录和一组相关记录之间的匹配标签。构造了带加权边的记录对图，并采用图神经网络在记录对之间传播信息。我们进一步证明GNEM可以被解释为现有成对匹配技术的扩展和推广。在真实数据集上的大量实验表明，GNEM始终优于现有的成对实体匹配技术，与最先进的神经方法相比，F1-Score提高了8.4%。

{"title":"GNEM: A Generic One-to-Set Neural Entity Matching Framework","authors":"Runjin Chen, Yanyan Shen, Dongxiang Zhang","doi":"10.1145/3442381.3450119","DOIUrl":"https://doi.org/10.1145/3442381.3450119","url":null,"abstract":"Entity Matching is a classic research problem in any data analytics pipeline, aiming to identify records referring to the same real-world entity. It plays an important role in data cleansing and integration. Advanced entity matching techniques focus on extracting syntactic or semantic features from record pairs via complex neural architectures or pre-trained language models. However, the performances always suffer from noisy or missing attribute values in the records. We observe that comparing one record with several relevant records in a collective manner allows each pairwise matching decision to be made by borrowing valuable insights from other pairs, which is beneficial to the overall matching performance. In this paper, we propose a generic one-to-set neural framework named GNEM for entity matching. GNEM predicts matching labels between one record and a set of relevant records simultaneously. It constructs a record pair graph with weighted edges and adopts the graph neural network to propagate information among pairs. We further show that GNEM can be interpreted as an extension and generalization of the existing pairwise matching techniques. Extensive experiments on real-world data sets demonstrate that GNEM consistently outperforms the existing pairwise entity matching techniques and achieves up to 8.4% improvement on F1-Score compared with the state-of-the-art neural methods.","PeriodicalId":106672,"journal":{"name":"Proceedings of the Web Conference 2021","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134101967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

ATJ-Net: Auto-Table-Join Network for Automatic Learning on Relational Databases 关系型数据库自动学习的自动表连接网络

Proceedings of the Web Conference 2021

Pub Date : 2021-04-19 DOI: 10.1145/3442381.3449980

Jinze Bai, Jialin Wang, Zhao Li, Donghui Ding, Ji Zhang, Jun Gao

A relational database, consisting of multiple tables, provides heterogeneous information across various entities, widely used in real-world services. This paper studies the supervised learning task on multiple tables, aiming to predict one label column with the help of multiple-tabular data. However, classical ML techniques mainly focus on single-tabular data. Multiple-tabular data refers to many-to-many mapping among joinable attributes and n-ary relations, which cannot be utilized directly by classical ML techniques. Besides, current graph techniques, like heterogeneous information network (HIN) and graph neural networks (GNN), are infeasible to be deployed directly and automatically in a multi-table environment, which limits the learning on databases. For automatic learning on relational databases, we propose an auto-table-join network (ATJ-Net). Multiple tables with relationships are considered as a hypergraph, where vertices are joinable attributes and hyperedges are tuples of tables. Then, ATJ-Net builds a graph neural network on the heterogeneous hypergraph, which samples and aggregates the vertices and hyperedges on n-hop sub-graphs as the receptive field. In order to enable ATJ-Net to be automatically deployed to different datasets and avoid the ”no free lunch” dilemma, we use random architecture search to select optimal aggregators and prune redundant paths in the network. For verifying the effectiveness of our methods across various tasks and schema, we conduct extensive experiments on 4 tasks, 8 various schemas, and 19 sub-datasets w.r.t. citing prediction, review classification, recommendation, and task-blind challenge. ATJ-Net achieves the best performance over state-of-the-art approaches on three tasks and is competitive with KddCup Winner solution on task-blind challenge.

关系数据库由多个表组成，提供跨各种实体的异构信息，广泛用于实际服务中。本文研究了多表的监督学习任务，旨在利用多表数据预测一个标签列。然而，经典的ML技术主要关注单表数据。多表数据是指可连接属性和n元关系之间的多对多映射，这是经典ML技术无法直接利用的。此外，目前的图技术，如异构信息网络(HIN)和图神经网络(GNN)，都无法在多表环境下直接自动部署，这限制了对数据库的学习。对于关系型数据库的自动学习，我们提出了一个自动表连接网络(ATJ-Net)。具有关系的多个表被视为一个超图，其中顶点是可连接的属性，超边是表的元组。然后，ATJ-Net在异构超图上构建图神经网络，对n跳子图上的顶点和超边进行采样和聚合，作为接收场。为了使ATJ-Net能够自动部署到不同的数据集上，避免“没有免费的午餐”的困境，我们使用随机架构搜索来选择最优聚合器，并修剪网络中的冗余路径。为了验证我们的方法在不同任务和模式下的有效性，我们在4个任务、8种不同模式和19个子数据集上进行了广泛的实验，引用了预测、评论分类、推荐和任务盲挑战。ATJ-Net在三个任务上实现了最先进的性能，在任务盲挑战上与KddCup Winner解决方案竞争。

{"title":"ATJ-Net: Auto-Table-Join Network for Automatic Learning on Relational Databases","authors":"Jinze Bai, Jialin Wang, Zhao Li, Donghui Ding, Ji Zhang, Jun Gao","doi":"10.1145/3442381.3449980","DOIUrl":"https://doi.org/10.1145/3442381.3449980","url":null,"abstract":"A relational database, consisting of multiple tables, provides heterogeneous information across various entities, widely used in real-world services. This paper studies the supervised learning task on multiple tables, aiming to predict one label column with the help of multiple-tabular data. However, classical ML techniques mainly focus on single-tabular data. Multiple-tabular data refers to many-to-many mapping among joinable attributes and n-ary relations, which cannot be utilized directly by classical ML techniques. Besides, current graph techniques, like heterogeneous information network (HIN) and graph neural networks (GNN), are infeasible to be deployed directly and automatically in a multi-table environment, which limits the learning on databases. For automatic learning on relational databases, we propose an auto-table-join network (ATJ-Net). Multiple tables with relationships are considered as a hypergraph, where vertices are joinable attributes and hyperedges are tuples of tables. Then, ATJ-Net builds a graph neural network on the heterogeneous hypergraph, which samples and aggregates the vertices and hyperedges on n-hop sub-graphs as the receptive field. In order to enable ATJ-Net to be automatically deployed to different datasets and avoid the ”no free lunch” dilemma, we use random architecture search to select optimal aggregators and prune redundant paths in the network. For verifying the effectiveness of our methods across various tasks and schema, we conduct extensive experiments on 4 tasks, 8 various schemas, and 19 sub-datasets w.r.t. citing prediction, review classification, recommendation, and task-blind challenge. ATJ-Net achieves the best performance over state-of-the-art approaches on three tasks and is competitive with KddCup Winner solution on task-blind challenge.","PeriodicalId":106672,"journal":{"name":"Proceedings of the Web Conference 2021","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133039576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Pivot-based Candidate Retrieval for Cross-lingual Entity Linking 基于数据轴的跨语言实体链接候选检索

Proceedings of the Web Conference 2021

Pub Date : 2021-04-19 DOI: 10.1145/3442381.3449852

Qian Liu, Xiubo Geng, Jie Lu, Daxin Jiang

Entity candidate retrieval plays a critical role in cross-lingual entity linking (XEL). In XEL, entity candidate retrieval needs to retrieve a list of plausible candidate entities from a large knowledge graph in a target language given a piece of text in a sentence or question, namely a mention, in a source language. Existing works mainly fall into two categories: lexicon-based and semantic-based approaches. The lexicon-based approach usually creates cross-lingual and mention-entity lexicons, which is effective but relies heavily on bilingual resources (e.g. inter-language links in Wikipedia). The semantic-based approach maps mentions and entities in different languages to a unified embedding space, which reduces dependence on large-scale bilingual dictionaries. However, its effectiveness is limited by the representation capacity of fixed-length vectors. In this paper, we propose a pivot-based approach which inherits the advantages of the aforementioned two approaches while avoiding their limitations. It takes an intermediary set of plausible target-language mentions as pivots to bridge the two types of gaps: cross-lingual gap and mention-entity gap. Specifically, it first converts mentions in the source language into an intermediary set of plausible mentions in the target language by cross-lingual semantic retrieval and a selective mechanism, and then retrieves candidate entities based on the generated mentions by lexical retrieval. The proposed approach only relies on a small bilingual word dictionary, and fully exploits the benefits of both lexical and semantic matching. Experimental results on two challenging cross-lingual entity linking datasets spanning over 11 languages show that the pivot-based approach outperforms both the lexicon-based and semantic-based approach by a large margin.

候选实体检索在跨语言实体链接中起着至关重要的作用。在XEL中，实体候选检索需要在给定源语言的句子或问题(即提及)中的一段文本的情况下，从目标语言的大型知识图谱中检索可信的候选实体列表。现有的研究主要分为两大类:基于词汇的方法和基于语义的方法。基于词典的方法通常创建跨语言和提及实体的词典，这是有效的，但严重依赖于双语资源(例如维基百科中的跨语言链接)。基于语义的方法将不同语言的提及和实体映射到统一的嵌入空间，减少了对大型双语词典的依赖。然而，它的有效性受到固定长度向量表示能力的限制。在本文中，我们提出了一种基于支点的方法，它继承了上述两种方法的优点，同时避免了它们的局限性。它以一组似是而非的目标语提及作为支点来弥合两种类型的差距:跨语言差距和提及-实体差距。具体来说，它首先通过跨语言语义检索和选择机制将源语言中的提及转换为目标语言中可信提及的中介集，然后通过词汇检索根据生成的提及检索候选实体。该方法仅依赖于一个小型的双语词词典，充分利用了词汇和语义匹配的优势。在跨越11种语言的两个具有挑战性的跨语言实体链接数据集上的实验结果表明，基于支点的方法在很大程度上优于基于词典和基于语义的方法。

{"title":"Pivot-based Candidate Retrieval for Cross-lingual Entity Linking","authors":"Qian Liu, Xiubo Geng, Jie Lu, Daxin Jiang","doi":"10.1145/3442381.3449852","DOIUrl":"https://doi.org/10.1145/3442381.3449852","url":null,"abstract":"Entity candidate retrieval plays a critical role in cross-lingual entity linking (XEL). In XEL, entity candidate retrieval needs to retrieve a list of plausible candidate entities from a large knowledge graph in a target language given a piece of text in a sentence or question, namely a mention, in a source language. Existing works mainly fall into two categories: lexicon-based and semantic-based approaches. The lexicon-based approach usually creates cross-lingual and mention-entity lexicons, which is effective but relies heavily on bilingual resources (e.g. inter-language links in Wikipedia). The semantic-based approach maps mentions and entities in different languages to a unified embedding space, which reduces dependence on large-scale bilingual dictionaries. However, its effectiveness is limited by the representation capacity of fixed-length vectors. In this paper, we propose a pivot-based approach which inherits the advantages of the aforementioned two approaches while avoiding their limitations. It takes an intermediary set of plausible target-language mentions as pivots to bridge the two types of gaps: cross-lingual gap and mention-entity gap. Specifically, it first converts mentions in the source language into an intermediary set of plausible mentions in the target language by cross-lingual semantic retrieval and a selective mechanism, and then retrieves candidate entities based on the generated mentions by lexical retrieval. The proposed approach only relies on a small bilingual word dictionary, and fully exploits the benefits of both lexical and semantic matching. Experimental results on two challenging cross-lingual entity linking datasets spanning over 11 languages show that the pivot-based approach outperforms both the lexicon-based and semantic-based approach by a large margin.","PeriodicalId":106672,"journal":{"name":"Proceedings of the Web Conference 2021","volume":"281 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122940370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

LChecker: Detecting Loose Comparison Bugs in PHP LChecker:检测PHP中的松散比较错误

Proceedings of the Web Conference 2021

Pub Date : 2021-04-19 DOI: 10.1145/3442381.3449826

Penghui Li, W. Meng

Weakly-typed languages such as PHP support loosely comparing two operands by implicitly converting their types and values. Such a language feature is widely used but can also pose severe security threats. In certain conditions, loose comparisons can cause unexpected results, leading to authentication bypass and other functionality problems. In this paper, we present the first in-depth study of such loose comparison bugs. We develop LChecker, a system to statically detect PHP loose comparison bugs. It employs a context-sensitive inter-procedural data-flow analysis together with several new techniques. We also enhance the PHP interpreter to help dynamically validate the detected bugs. Our evaluation shows that LChecker can both effectively and efficiently detect PHP loose comparison bugs with a reasonably low false-positive rate. It also successfully detected all previously known bugs in our evaluation dataset with no false negative. Using LChecker, we discovered 42 new loose comparison bugs and were assigned 9 new CVE IDs.

PHP等弱类型语言通过隐式转换两个操作数的类型和值来支持松散比较。这种语言特性被广泛使用，但也可能带来严重的安全威胁。在某些情况下，松散比较可能会导致意想不到的结果，从而导致身份验证绕过和其他功能问题。在本文中，我们首次对这种松散比较错误进行了深入研究。我们开发了LChecker，一个静态检测PHP松散比较错误的系统。它采用上下文敏感的过程间数据流分析以及几种新技术。我们还增强了PHP解释器，以帮助动态验证检测到的错误。我们的评估表明，LChecker能够以相当低的假阳性率有效地检测PHP松散比较错误。它还成功地检测到我们的评估数据集中所有以前已知的错误，没有假阴性。使用LChecker，我们发现了42个新的松散比较错误，并分配了9个新的CVE id。

引用次数: 7

What do You Mean? Interpreting Image Classification with Crowdsourced Concept Extraction and Analysis 你是什么意思?基于众包概念提取与分析的图像分类解释

Proceedings of the Web Conference 2021

Pub Date : 2021-04-19 DOI: 10.1145/3442381.3450069

Agathe Balayn, Panagiotis Soilis, C. Lofi, Jie Yang, A. Bozzon

Global interpretability is a vital requirement for image classification applications. Existing interpretability methods mainly explain a model behavior by identifying salient image patches, which require manual efforts from users to make sense of, and also do not typically support model validation with questions that investigate multiple visual concepts. In this paper, we introduce a scalable human-in-the-loop approach for global interpretability. Salient image areas identified by local interpretability methods are annotated with semantic concepts, which are then aggregated into a tabular representation of images to facilitate automatic statistical analysis of model behavior. We show that this approach answers interpretability needs for both model validation and exploration, and provides semantically more diverse, informative, and relevant explanations while still allowing for scalable and cost-efficient execution.

全局可解释性是图像分类应用的重要要求。现有的可解释性方法主要通过识别显著图像补丁来解释模型行为，这些补丁需要用户手工操作才能理解，并且通常不支持用调查多个视觉概念的问题来验证模型。在本文中，我们引入了一种可扩展的人在环方法来实现全局可解释性。通过局部可解释性方法识别的显著图像区域用语义概念进行注释，然后将其聚合成图像的表格表示，以方便模型行为的自动统计分析。我们表明，这种方法满足了模型验证和探索的可解释性需求，并提供了语义上更多样化、信息量更大、更相关的解释，同时仍然允许可扩展和经济高效的执行。

引用次数: 21

Attent: Active Attributed Network Alignment 注意事项:主动属性网络对齐

Proceedings of the Web Conference 2021

Pub Date : 2021-04-19 DOI: 10.1145/3442381.3449886

Qinghai Zhou, Liangyue Li, Xintao Wu, Nan Cao, Lei Ying, Hanghang Tong

Network alignment finds node correspondences across multiple networks, where the alignment accuracy is of crucial importance because of its profound impact on downstream applications. The vast majority of existing works focus on how to best utilize the topology and attribute information of the input networks as well as the anchor links when available. Nonetheless, it has not been well studied on how to boost the alignment performance through actively obtaining high-quality and informative anchor links, with a few exceptions. The sparse literature on active network alignment introduces the human in the loop to label some seed node correspondence (i.e., anchor links), which are informative from the perspective of querying the most uncertain node given few potential matchings. However, the direct influence of the intrinsic network attribute information on the alignment results has largely remained unknown. In this paper, we tackle this challenge and propose an active network alignment method (Attent) to identify the best nodes to query. The key idea of the proposed method is to leverage effective and efficient influence functions defined over the alignment solution to evaluate the goodness of the candidate nodes for query. Our proposed query strategy bears three distinct advantages, including (1) effectiveness, being able to accurately quantify the influence of the candidate nodes on the alignment results; (2) efficiency, scaling linearly with 15 − 17 × speed-up over the straight-forward implementation without any quality loss; (3) generality, consistently improving alignment performance of a variety of network alignment algorithms.

网络对齐查找跨多个网络的节点对应，其中对齐精度至关重要，因为它对下游应用程序有深远的影响。现有的绝大多数工作都集中在如何最好地利用输入网络的拓扑和属性信息以及可用的锚链接。然而，除了少数例外，如何通过主动获取高质量和信息丰富的锚链接来提高定位性能还没有得到很好的研究。在主动网络对齐的稀疏文献中，引入了环路中的人来标记一些种子节点对应(即锚链接)，从在潜在匹配很少的情况下查询最不确定的节点的角度来看，这是有信息的。然而，内部网络属性信息对对齐结果的直接影响在很大程度上仍然未知。在本文中，我们解决了这一挑战，并提出了一种主动网络对齐方法(attention)来识别查询的最佳节点。该方法的关键思想是利用在对齐解决方案上定义的有效和高效的影响函数来评估查询候选节点的优劣。我们提出的查询策略具有三个明显的优势，包括:(1)有效性，能够准确地量化候选节点对对齐结果的影响;(2)效率，在没有任何质量损失的情况下，与直接实现相比，线性扩展15 - 17倍的加速;(3)通用性，不断提高各种网络对准算法的对准性能。

{"title":"Attent: Active Attributed Network Alignment","authors":"Qinghai Zhou, Liangyue Li, Xintao Wu, Nan Cao, Lei Ying, Hanghang Tong","doi":"10.1145/3442381.3449886","DOIUrl":"https://doi.org/10.1145/3442381.3449886","url":null,"abstract":"Network alignment finds node correspondences across multiple networks, where the alignment accuracy is of crucial importance because of its profound impact on downstream applications. The vast majority of existing works focus on how to best utilize the topology and attribute information of the input networks as well as the anchor links when available. Nonetheless, it has not been well studied on how to boost the alignment performance through actively obtaining high-quality and informative anchor links, with a few exceptions. The sparse literature on active network alignment introduces the human in the loop to label some seed node correspondence (i.e., anchor links), which are informative from the perspective of querying the most uncertain node given few potential matchings. However, the direct influence of the intrinsic network attribute information on the alignment results has largely remained unknown. In this paper, we tackle this challenge and propose an active network alignment method (Attent) to identify the best nodes to query. The key idea of the proposed method is to leverage effective and efficient influence functions defined over the alignment solution to evaluate the goodness of the candidate nodes for query. Our proposed query strategy bears three distinct advantages, including (1) effectiveness, being able to accurately quantify the influence of the candidate nodes on the alignment results; (2) efficiency, scaling linearly with 15 − 17 × speed-up over the straight-forward implementation without any quality loss; (3) generality, consistently improving alignment performance of a variety of network alignment algorithms.","PeriodicalId":106672,"journal":{"name":"Proceedings of the Web Conference 2021","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124555927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

XY-Sketch: on Sketching Data Streams at Web Scale XY-Sketch:在网络规模上绘制数据流

Proceedings of the Web Conference 2021

Pub Date : 2021-04-19 DOI: 10.1145/3442381.3449984

Yongqiang Liu, Xike Xie

Conventional sketching methods on counting stream item frequencies use hash functions for mapping data items to a concise structure, e.g., a two-dimensional array, at the expense of overcounting due to hashing collisions. Despite the popularity, however, the accumulated errors originated in hashing collisions deteriorate the sketching accuracies at the rapid pace of data increasing, which poses a great challenge to sketch big data streams at web scale. In this paper, we propose a novel structure, called XY-sketch, which estimates the frequency of a data item by estimating the probability of this item appearing in the data stream. The framework associated with XY-sketch consists of two phases, namely decomposition and recomposition phases. A data item is split into a set of compactly stored basic elements, which can be stringed up in a probabilistic manner for query evaluation during the recomposition phase. Throughout, we conduct optimization under space constraints and detailed theoretical analysis. Experiments on both real and synthetic datasets are done to show the superior scalability on sketching large-scale streams. Remarkably, XY-sketch is orders of magnitudes more accurate than existing solutions, when the space budget is small.

传统的计算流项频率的草图方法使用哈希函数将数据项映射到一个简洁的结构，例如，一个二维数组，代价是由于哈希冲突而导致计数过多。然而，在数据快速增长的情况下，哈希碰撞产生的累积误差会降低绘制精度，这对web规模的大数据流绘制提出了很大的挑战。在本文中，我们提出了一种新的结构，称为XY-sketch，它通过估计数据项在数据流中出现的概率来估计数据项的频率。与XY-sketch相关的框架包括两个阶段，即分解和重组阶段。数据项被分割成一组紧凑存储的基本元素，这些元素可以以概率方式串起来，以便在重组阶段进行查询计算。在整个过程中，我们在空间约束下进行了优化，并进行了详细的理论分析。在真实数据集和合成数据集上进行了实验，证明了该方法在绘制大规模流图方面具有优越的可扩展性。值得注意的是，当空间预算很小时，XY-sketch比现有的解决方案精确了几个数量级。

{"title":"XY-Sketch: on Sketching Data Streams at Web Scale","authors":"Yongqiang Liu, Xike Xie","doi":"10.1145/3442381.3449984","DOIUrl":"https://doi.org/10.1145/3442381.3449984","url":null,"abstract":"Conventional sketching methods on counting stream item frequencies use hash functions for mapping data items to a concise structure, e.g., a two-dimensional array, at the expense of overcounting due to hashing collisions. Despite the popularity, however, the accumulated errors originated in hashing collisions deteriorate the sketching accuracies at the rapid pace of data increasing, which poses a great challenge to sketch big data streams at web scale. In this paper, we propose a novel structure, called XY-sketch, which estimates the frequency of a data item by estimating the probability of this item appearing in the data stream. The framework associated with XY-sketch consists of two phases, namely decomposition and recomposition phases. A data item is split into a set of compactly stored basic elements, which can be stringed up in a probabilistic manner for query evaluation during the recomposition phase. Throughout, we conduct optimization under space constraints and detailed theoretical analysis. Experiments on both real and synthetic datasets are done to show the superior scalability on sketching large-scale streams. Remarkably, XY-sketch is orders of magnitudes more accurate than existing solutions, when the space budget is small.","PeriodicalId":106672,"journal":{"name":"Proceedings of the Web Conference 2021","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127807576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

ColChain: Collaborative Linked Data Networks ColChain:协作关联数据网络

Proceedings of the Web Conference 2021

Pub Date : 2021-04-19 DOI: 10.1145/3442381.3450037

Christian Aebeloe, Gabriela Montoya, K. Hose

One of the major obstacles that currently prevents the Semantic Web from exploiting its full potential is that the data it provides access to is sometimes not available or outdated. The reason is rooted deep within its architecture that relies on data providers to keep the data available, queryable, and up-to-date at all times – an expectation that many data providers in reality cannot live up to for an extended (or infinite) period of time. Hence, decentralized architectures have recently been proposed that use replication to keep the data available in case the data provider fails. Although this increases availability, it does not help keeping the data up-to-date or allow users to query and access previous versions of a dataset. In this paper, we therefore propose ColChain (COLlaborative knowledge CHAINs), a novel decentralized architecture based on blockchains that not only lowers the burden for the data providers but at the same time also allows users to propose updates to faulty or outdated data, trace updates back to their origin, and query older versions of the data. Our extensive experiments show that ColChain reaches these goals while achieving query processing performance comparable to the state of the art.

目前阻碍语义网充分发挥其潜力的主要障碍之一是，它提供访问的数据有时不可用或过时。其原因深深植根于它的体系结构中，该体系结构依赖于数据提供者来保持数据的可用性、可查询性和随时更新——这是许多数据提供者在长时间(或无限时间)内无法实现的期望。因此，最近有人提出使用复制来保持数据可用的分散架构，以防数据提供者出现故障。虽然这增加了可用性，但它无助于保持数据的最新或允许用户查询和访问数据集的以前版本。因此，在本文中，我们提出了ColChain(协作知识链)，这是一种基于区块链的新型去中心化架构，它不仅降低了数据提供者的负担，同时还允许用户对错误或过时的数据提出更新建议，追溯更新的来源，并查询旧版本的数据。我们的大量实验表明，ColChain达到了这些目标，同时实现了与最先进的查询处理性能相当的性能。

{"title":"ColChain: Collaborative Linked Data Networks","authors":"Christian Aebeloe, Gabriela Montoya, K. Hose","doi":"10.1145/3442381.3450037","DOIUrl":"https://doi.org/10.1145/3442381.3450037","url":null,"abstract":"One of the major obstacles that currently prevents the Semantic Web from exploiting its full potential is that the data it provides access to is sometimes not available or outdated. The reason is rooted deep within its architecture that relies on data providers to keep the data available, queryable, and up-to-date at all times – an expectation that many data providers in reality cannot live up to for an extended (or infinite) period of time. Hence, decentralized architectures have recently been proposed that use replication to keep the data available in case the data provider fails. Although this increases availability, it does not help keeping the data up-to-date or allow users to query and access previous versions of a dataset. In this paper, we therefore propose ColChain (COLlaborative knowledge CHAINs), a novel decentralized architecture based on blockchains that not only lowers the burden for the data providers but at the same time also allows users to propose updates to faulty or outdated data, trace updates back to their origin, and query older versions of the data. Our extensive experiments show that ColChain reaches these goals while achieving query processing performance comparable to the state of the art.","PeriodicalId":106672,"journal":{"name":"Proceedings of the Web Conference 2021","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128938699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

WebSocket Adoption and the Landscape of the Real-Time Web WebSocket的采用和实时网络的前景

Proceedings of the Web Conference 2021

Pub Date : 2021-04-19 DOI: 10.1145/3442381.3450063

Paul Murley, Zane Ma, Joshua Mason, Michael Bailey, Amin Kharraz

Developers are increasingly deploying web applications which require real-time bidirectional updates, a use case which does not naturally align with the traditional client-server architecture of the web. Many solutions have arisen to address this need over the preceding decades, including HTTP polling, Server-Sent Events, and WebSockets. This paper investigates this ecosystem and reports on the prevalence, benefits, and drawbacks of these technologies, with a particular focus on the adoption of WebSockets. We crawl the Tranco Top 1 Million websites to build a dataset for studying real-time updates in the wild. We find that HTTP Polling remains significantly more common than WebSockets, and WebSocket adoption appears to have stagnated in the past two to three years. We investigate some of the possible reasons for this decrease in the rate of adoption, and we contrast the adoption process to that of other web technologies. Our findings further suggest that even when WebSockets are employed, the prescribed best practices for securing them are often disregarded. The dataset is made available in the hopes that it may help inform the development of future real-time solutions for the web.

开发人员越来越多地部署需要实时双向更新的web应用程序，这种用例与传统的web客户端-服务器架构并不自然一致。在过去的几十年里，出现了许多解决方案来满足这一需求，包括HTTP轮询、服务器发送事件和WebSockets。本文调查了这个生态系统，并报告了这些技术的流行、优点和缺点，特别关注了WebSockets的采用。我们抓取了Tranco排名前100万的网站，建立了一个数据集，用于研究野外的实时更新。我们发现HTTP轮询仍然比WebSocket更普遍，WebSocket的采用在过去的两到三年中似乎停滞不前。我们调查了这种采用率下降的一些可能原因，并将其采用过程与其他web技术进行了对比。我们的研究结果进一步表明，即使使用了WebSockets，规定的保护它们的最佳实践也经常被忽视。提供这个数据集的目的是希望它可以为未来网络实时解决方案的开发提供信息。

{"title":"WebSocket Adoption and the Landscape of the Real-Time Web","authors":"Paul Murley, Zane Ma, Joshua Mason, Michael Bailey, Amin Kharraz","doi":"10.1145/3442381.3450063","DOIUrl":"https://doi.org/10.1145/3442381.3450063","url":null,"abstract":"Developers are increasingly deploying web applications which require real-time bidirectional updates, a use case which does not naturally align with the traditional client-server architecture of the web. Many solutions have arisen to address this need over the preceding decades, including HTTP polling, Server-Sent Events, and WebSockets. This paper investigates this ecosystem and reports on the prevalence, benefits, and drawbacks of these technologies, with a particular focus on the adoption of WebSockets. We crawl the Tranco Top 1 Million websites to build a dataset for studying real-time updates in the wild. We find that HTTP Polling remains significantly more common than WebSockets, and WebSocket adoption appears to have stagnated in the past two to three years. We investigate some of the possible reasons for this decrease in the rate of adoption, and we contrast the adoption process to that of other web technologies. Our findings further suggest that even when WebSockets are employed, the prescribed best practices for securing them are often disregarded. The dataset is made available in the hopes that it may help inform the development of future real-time solutions for the web.","PeriodicalId":106672,"journal":{"name":"Proceedings of the Web Conference 2021","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128815546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Temporal Analysis of the Entire Ethereum Blockchain Network 整个以太坊区块链网络的时间分析

Proceedings of the Web Conference 2021

Pub Date : 2021-04-19 DOI: 10.1145/3442381.3449916

Lin Zhao, Sourav Sengupta, Arijit Khan, Robby Luo

With over 42 billion USD market capitalization (October 2020), Ethereum is the largest public blockchain that supports smart contracts. Recent works have modeled transactions, tokens, and other interactions in the Ethereum blockchain as static graphs to provide new observations and insights by conducting relevant graph analysis. Surprisingly, there is much less study on the evolution and temporal properties of these networks. In this paper, we investigate the evolutionary nature of Ethereum interaction networks from a temporal graphs perspective. We study the growth rate and model of four Ethereum blockchain networks, active lifespan and update rate of high-degree vertices. We detect anomalies based on temporal changes in global network properties, and forecast the survival of network communities in succeeding months leveraging on the relevant graph features and machine learning models.

以太坊市值超过420亿美元(2020年10月)，是支持智能合约的最大公共区块链。最近的工作将以太坊区块链中的交易，令牌和其他交互建模为静态图形，通过进行相关的图形分析来提供新的观察和见解。令人惊讶的是，对这些网络的进化和时间特性的研究要少得多。在本文中，我们从时间图的角度研究了以太坊交互网络的进化本质。我们研究了四个以太坊区块链网络的增长率和模型，活跃寿命和高度顶点的更新率。我们根据全球网络属性的时间变化检测异常，并利用相关的图特征和机器学习模型预测网络社区在随后几个月的生存。

引用次数: 34