首页 > 最新文献

2020 IEEE 36th International Conference on Data Engineering (ICDE)最新文献

英文 中文
Speed Kit: A Polyglot & GDPR-Compliant Approach For Caching Personalized Content 速度套件:多语言和gdpr兼容的方法缓存个性化内容
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00142
Wolfram Wingerath, Felix Gessert, Erik Witt, Hannes Kuhlmann, Florian Bücklers, Benjamin Wollmer, N. Ritter
Users leave when page loads take too long. This simple fact has complex implications for virtually all modern businesses, because accelerating content delivery through caching is not as simple as it used to be. As a fundamental technical challenge, the high degree of personalization in today’s Web has seemingly outgrown the capabilities of traditional content delivery networks (CDNs) which have been designed for distributing static assets under fixed caching times. As an additional legal challenge for services with personalized content, an increasing number of regional data protection laws constrain the ways in which CDNs can be used in the first place. In this paper, we present Speed Kit as a radically different approach for content distribution that combines (1) a polyglot architecture for efficiently caching personalized content with (2) a natively GDPR-compliant client proxy that handles all sensitive information within the user device. We describe the system design and implementation, explain the custom cache coherence protocol to avoid data staleness and achieve Δ-atomicity, and we share field experiences from over a year of productive use in the e-commerce industry.
当页面加载时间过长时,用户会离开。这个简单的事实对几乎所有现代企业都有复杂的影响,因为通过缓存加速内容交付不像以前那么简单了。作为一项基本的技术挑战,当今Web中的高度个性化似乎已经超出了传统内容交付网络(cdn)的能力,传统内容交付网络的设计目的是在固定的缓存时间内分发静态资产。作为个性化内容服务面临的额外法律挑战,越来越多的区域数据保护法律首先限制了cdn的使用方式。在本文中,我们将Speed Kit作为一种完全不同的内容分发方法,它结合了(1)用于高效缓存个性化内容的多语言架构和(2)处理用户设备内所有敏感信息的本地gdpr兼容客户端代理。我们描述了系统的设计和实现,解释了自定义缓存一致性协议以避免数据过时并实现Δ-atomicity,并且我们分享了在电子商务行业中一年多的生产性使用的现场经验。
{"title":"Speed Kit: A Polyglot & GDPR-Compliant Approach For Caching Personalized Content","authors":"Wolfram Wingerath, Felix Gessert, Erik Witt, Hannes Kuhlmann, Florian Bücklers, Benjamin Wollmer, N. Ritter","doi":"10.1109/ICDE48307.2020.00142","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00142","url":null,"abstract":"Users leave when page loads take too long. This simple fact has complex implications for virtually all modern businesses, because accelerating content delivery through caching is not as simple as it used to be. As a fundamental technical challenge, the high degree of personalization in today’s Web has seemingly outgrown the capabilities of traditional content delivery networks (CDNs) which have been designed for distributing static assets under fixed caching times. As an additional legal challenge for services with personalized content, an increasing number of regional data protection laws constrain the ways in which CDNs can be used in the first place. In this paper, we present Speed Kit as a radically different approach for content distribution that combines (1) a polyglot architecture for efficiently caching personalized content with (2) a natively GDPR-compliant client proxy that handles all sensitive information within the user device. We describe the system design and implementation, explain the custom cache coherence protocol to avoid data staleness and achieve Δ-atomicity, and we share field experiences from over a year of productive use in the e-commerce industry.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"28 1","pages":"1603-1608"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90906910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Automatic View Generation with Deep Learning and Reinforcement Learning 基于深度学习和强化学习的自动视图生成
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00133
Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, Yue Han
Materializing views is an important method to reduce redundant computations in DBMS, especially for processing large scale analytical queries. However, many existing methods still need DBAs to manually generate materialized views, which are not scalable to a large number of database instances, especially on the cloud database. To address this problem, we propose an automatic view generation method which judiciously selects "highly beneficial" subqueries to generate materialized views. However, there are two challenges. (1) How to estimate the benefit of using a materialized view for a queryƒ (2) How to select optimal subqueries to generate materialized viewsƒ To address the first challenge, we propose a neural network based method to estimate the benefit of using a materialized view to answer a query. In particular, we extract significant features from different perspectives and design effective encoding models to transform these features into hidden representations. To address the second challenge, we model this problem to an ILP (Integer Linear Programming) problem, which aims to maximize the utility by selecting optimal subqueries to materialize. We design an iterative optimization method to select subqueries to materialize. However, this method cannot guarantee the convergence of the solution. To address this issue, we model the iterative optimization process as an MDP (Markov Decision Process) and use the deep reinforcement learning model to solve the problem. Extensive experiments show that our method outperforms existing solutions by 28.4%, 8.8% and 31.7% on three real-world datasets.
物化视图是减少数据库管理系统中冗余计算的一种重要方法,特别是在处理大规模分析查询时。但是,许多现有的方法仍然需要dba手动生成物化视图,这不能扩展到大量的数据库实例,特别是在云数据库上。为了解决这个问题,我们提出了一种自动视图生成方法,该方法明智地选择“高度有益”的子查询来生成物化视图。然而,有两个挑战。(1)如何估计使用物化视图进行查询的好处(2)如何选择最优子查询来生成物化视图为了解决第一个挑战,我们提出了一种基于神经网络的方法来估计使用物化视图回答查询的好处。特别是,我们从不同的角度提取重要的特征,并设计有效的编码模型,将这些特征转换为隐藏的表征。为了解决第二个挑战,我们将该问题建模为ILP(整数线性规划)问题,该问题旨在通过选择最优子查询来实现效用最大化。我们设计了一种迭代优化方法来选择要实现的子查询。但该方法不能保证解的收敛性。为了解决这个问题,我们将迭代优化过程建模为MDP(马尔可夫决策过程),并使用深度强化学习模型来解决这个问题。大量的实验表明,我们的方法在三个真实数据集上比现有的解决方案分别高出28.4%、8.8%和31.7%。
{"title":"Automatic View Generation with Deep Learning and Reinforcement Learning","authors":"Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, Yue Han","doi":"10.1109/ICDE48307.2020.00133","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00133","url":null,"abstract":"Materializing views is an important method to reduce redundant computations in DBMS, especially for processing large scale analytical queries. However, many existing methods still need DBAs to manually generate materialized views, which are not scalable to a large number of database instances, especially on the cloud database. To address this problem, we propose an automatic view generation method which judiciously selects \"highly beneficial\" subqueries to generate materialized views. However, there are two challenges. (1) How to estimate the benefit of using a materialized view for a queryƒ (2) How to select optimal subqueries to generate materialized viewsƒ To address the first challenge, we propose a neural network based method to estimate the benefit of using a materialized view to answer a query. In particular, we extract significant features from different perspectives and design effective encoding models to transform these features into hidden representations. To address the second challenge, we model this problem to an ILP (Integer Linear Programming) problem, which aims to maximize the utility by selecting optimal subqueries to materialize. We design an iterative optimization method to select subqueries to materialize. However, this method cannot guarantee the convergence of the solution. To address this issue, we model the iterative optimization process as an MDP (Markov Decision Process) and use the deep reinforcement learning model to solve the problem. Extensive experiments show that our method outperforms existing solutions by 28.4%, 8.8% and 31.7% on three real-world datasets.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"37 1","pages":"1501-1512"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91210970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
BFT-Store: Storage Partition for Permissioned Blockchain via Erasure Coding BFT-Store:通过Erasure编码的许可区块链存储分区
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00205
Xiaodong Qi, Zhao Zhang, Cheqing Jin, Aoying Zhou
The full-replication data storage mechanism, as commonly utilized in existing blockchain systems, is lack of sufficient storage scalability, since it reserves a copy of the whole block data in each node so that the overall storage consumption per block is O(n) with n nodes. Moreover, due to the existence of Byzantine nodes, existing partitioning methods, though widely adopted in distributed systems for decades, cannot suit for blockchain systems directly, thereby it is critical to devise a new storage mechanism. This paper proposes a novel storage engine, called BFT-Store, to enhance storage scalability by integrating erasure coding with Byzantine Fault Tolerance (BFT) consensus protocol. First, the storage consumption per block can be reduced to O(1), which enlarges overall storage capability when more nodes join blockchain. Second, an efficient online re-encoding protocol is designed for storage scale-out and a hybrid replication scheme is employed to improve reading performance. Last, extensive experimental results illustrate the scalability, availability and efficiency of BFT-Store, which is implemented on an open-source permissioned blockchain Tendermint.
现有区块链系统中常用的全复制数据存储机制缺乏足够的存储可扩展性,因为它在每个节点中保留了整个块数据的副本,因此每个块的总体存储消耗是O(n),有n个节点。此外,由于拜占庭节点的存在,现有的分区方法虽然在分布式系统中被广泛采用了几十年,但不能直接适用于区块链系统,因此设计一种新的存储机制是至关重要的。本文提出了一种新的存储引擎BFT- store,通过将擦除编码与拜占庭容错(BFT)共识协议相结合来提高存储的可扩展性。首先,每个区块的存储消耗可以减少到0(1),当更多的节点加入区块链时,整体存储能力就会扩大。其次,设计了一种高效的在线重编码协议用于存储横向扩展,并采用混合复制方案提高读取性能。最后,大量的实验结果说明了在开源许可区块链Tendermint上实现的BFT-Store的可扩展性、可用性和效率。
{"title":"BFT-Store: Storage Partition for Permissioned Blockchain via Erasure Coding","authors":"Xiaodong Qi, Zhao Zhang, Cheqing Jin, Aoying Zhou","doi":"10.1109/ICDE48307.2020.00205","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00205","url":null,"abstract":"The full-replication data storage mechanism, as commonly utilized in existing blockchain systems, is lack of sufficient storage scalability, since it reserves a copy of the whole block data in each node so that the overall storage consumption per block is O(n) with n nodes. Moreover, due to the existence of Byzantine nodes, existing partitioning methods, though widely adopted in distributed systems for decades, cannot suit for blockchain systems directly, thereby it is critical to devise a new storage mechanism. This paper proposes a novel storage engine, called BFT-Store, to enhance storage scalability by integrating erasure coding with Byzantine Fault Tolerance (BFT) consensus protocol. First, the storage consumption per block can be reduced to O(1), which enlarges overall storage capability when more nodes join blockchain. Second, an efficient online re-encoding protocol is designed for storage scale-out and a hybrid replication scheme is employed to improve reading performance. Last, extensive experimental results illustrate the scalability, availability and efficiency of BFT-Store, which is implemented on an open-source permissioned blockchain Tendermint.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"89 1","pages":"1926-1929"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73216128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Online Trichromatic Pickup and Delivery Scheduling in Spatial Crowdsourcing 空间众包中的在线三色取货调度
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00089
Bolong Zheng, Chenze Huang, Christian S. Jensen, Lu Chen, Nguyen Quoc Viet Hung, Guanfeng Liu, Guohui Li, Kai Zheng
In Pickup-and-Delivery problems (PDP), mobile workers are employed to pick up and deliver items with the goal of reducing travel and fuel consumption. Unlike most existing efforts that focus on finding a schedule that enables the delivery of as many items as possible at the lowest cost, we consider trichromatic (worker-item-task) utility that encompasses worker reliability, item quality, and task profitability. Moreover, we allow customers to specify keywords for desired items when they submit tasks, which may result in multiple pickup options, thus further increasing the difficulty of the problem. Specifically, we formulate the problem of Online Trichromatic Pickup and Delivery Scheduling (OTPD) that aims to find optimal delivery schedules with highest overall utility. In order to quickly respond to submitted tasks, we propose a greedy solution that finds the schedule with the highest utility-cost ratio. Next, we introduce a skyline kinetic tree-based solution that materializes intermediate results to improve the result quality. Finally, we propose a density-based grouping solution that partitions streaming tasks and efficiently assigns them to the workers with high overall utility. Extensive experiments with real and synthetic data offer evidence that the proposed solutions excel over baselines with respect to both effectiveness and efficiency.
在拾取和交付问题(PDP)中,雇用移动工人来拾取和交付物品,以减少旅行和燃料消耗。与大多数现有的专注于寻找能够以最低成本交付尽可能多的项目的时间表的工作不同,我们考虑了三色(工人-项目-任务)实用程序,它包含了工人可靠性、项目质量和任务盈利能力。此外,我们允许客户在提交任务时指定所需物品的关键字,这可能会导致多个取件选项,从而进一步增加了问题的难度。具体来说,我们制定了在线三色取货和交货调度(OTPD)的问题,旨在找到具有最高整体效用的最佳交货时间表。为了快速响应提交的任务,我们提出了一个贪心的解决方案,寻找效用成本比最高的调度。接下来,我们将介绍一种基于天际线动态树的解决方案,该解决方案将中间结果物化以提高结果质量。最后,我们提出了一种基于密度的分组解决方案,该方案对流任务进行分区,并有效地将其分配给具有高整体效用的工人。对真实数据和合成数据进行的广泛实验证明,所提出的解决方案在有效性和效率方面都优于基线。
{"title":"Online Trichromatic Pickup and Delivery Scheduling in Spatial Crowdsourcing","authors":"Bolong Zheng, Chenze Huang, Christian S. Jensen, Lu Chen, Nguyen Quoc Viet Hung, Guanfeng Liu, Guohui Li, Kai Zheng","doi":"10.1109/ICDE48307.2020.00089","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00089","url":null,"abstract":"In Pickup-and-Delivery problems (PDP), mobile workers are employed to pick up and deliver items with the goal of reducing travel and fuel consumption. Unlike most existing efforts that focus on finding a schedule that enables the delivery of as many items as possible at the lowest cost, we consider trichromatic (worker-item-task) utility that encompasses worker reliability, item quality, and task profitability. Moreover, we allow customers to specify keywords for desired items when they submit tasks, which may result in multiple pickup options, thus further increasing the difficulty of the problem. Specifically, we formulate the problem of Online Trichromatic Pickup and Delivery Scheduling (OTPD) that aims to find optimal delivery schedules with highest overall utility. In order to quickly respond to submitted tasks, we propose a greedy solution that finds the schedule with the highest utility-cost ratio. Next, we introduce a skyline kinetic tree-based solution that materializes intermediate results to improve the result quality. Finally, we propose a density-based grouping solution that partitions streaming tasks and efficiently assigns them to the workers with high overall utility. Extensive experiments with real and synthetic data offer evidence that the proposed solutions excel over baselines with respect to both effectiveness and efficiency.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"973-984"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74223517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Design of Database Systems with DRAM-only Heterogeneous Memory Architecture 基于异构内存架构的数据库系统设计
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00243
Yifan Qiao
This thesis advocates a novel DRAM-only strategy to reduce the computing system memory cost for the first time, and investigates its applications to database systems. This thesis envisions a low-cost DRAM module called block-protected DRAM, which reduces bit cost by significantly relaxing the DRAM raw reliability and meanwhile employs long error correction code (ECC) to ensure data integrity at small coding redundancy. Built upon the exactly same DRAM technology, today’s byte-accessible DRAM and envisioned block-protected DRAM strike at different trade-offs between memory bit cost and native data access granularity, and naturally form a heterogeneous DRAM-only memory system. The practical feasibility of such heterogeneous memory systems is further strengthened by the new media-agnostic and latency-oblivious CPU-memory interfaces such as IBM’s OpenCAPI/OMI and Intel’s CXL. This DRAM-only design approach perfectly leverages the existing DRAM manufacturing infrastructure and is not subject to any fundamental technology risk and uncertainty. Hence, before NVM technologies could eventually fulfill their long-awaited promises (i.e., DRAM-grade speed at flash-grade cost), this DRAM-only design framework can fill the gap to empower continuous progress and advances of computing systems. This thesis aims to develop techniques that enable relational and NoSQL databases to take full advantage of the envisioned low-cost heterogeneous DRAM system. As the first step, we studied how one could employ heterogeneous DRAM to implement a low-cost tiered caching solution for relational database, and obtained encouraging results using MySQL as a test vehicle.
本文首次提出了一种新的全内存策略来降低计算系统的内存成本,并对其在数据库系统中的应用进行了研究。本文设想了一种低成本的DRAM模块,称为块保护DRAM,它通过显著降低DRAM原始可靠性来降低比特成本,同时采用长纠错码(ECC)来确保小编码冗余下的数据完整性。基于完全相同的DRAM技术,今天的字节可访问DRAM和预期的块保护DRAM在内存位成本和本机数据访问粒度之间进行了不同的权衡,自然形成了异构的DRAM存储系统。IBM的OpenCAPI/OMI和Intel的CXL等新的与媒体无关和延迟无关的cpu -内存接口进一步加强了这种异构存储系统的实际可行性。这种纯DRAM设计方法完美地利用了现有的DRAM制造基础设施,不受任何基础技术风险和不确定性的影响。因此,在NVM技术最终实现其期待已久的承诺(即以闪存级成本实现dram级速度)之前,这种仅使用dram的设计框架可以填补空白,从而使计算系统不断进步和进步。本论文旨在开发技术,使关系数据库和NoSQL数据库能够充分利用所设想的低成本异构DRAM系统。作为第一步,我们研究了如何使用异构DRAM来实现关系数据库的低成本分层缓存解决方案,并使用MySQL作为测试工具获得了令人鼓舞的结果。
{"title":"Design of Database Systems with DRAM-only Heterogeneous Memory Architecture","authors":"Yifan Qiao","doi":"10.1109/ICDE48307.2020.00243","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00243","url":null,"abstract":"This thesis advocates a novel DRAM-only strategy to reduce the computing system memory cost for the first time, and investigates its applications to database systems. This thesis envisions a low-cost DRAM module called block-protected DRAM, which reduces bit cost by significantly relaxing the DRAM raw reliability and meanwhile employs long error correction code (ECC) to ensure data integrity at small coding redundancy. Built upon the exactly same DRAM technology, today’s byte-accessible DRAM and envisioned block-protected DRAM strike at different trade-offs between memory bit cost and native data access granularity, and naturally form a heterogeneous DRAM-only memory system. The practical feasibility of such heterogeneous memory systems is further strengthened by the new media-agnostic and latency-oblivious CPU-memory interfaces such as IBM’s OpenCAPI/OMI and Intel’s CXL. This DRAM-only design approach perfectly leverages the existing DRAM manufacturing infrastructure and is not subject to any fundamental technology risk and uncertainty. Hence, before NVM technologies could eventually fulfill their long-awaited promises (i.e., DRAM-grade speed at flash-grade cost), this DRAM-only design framework can fill the gap to empower continuous progress and advances of computing systems. This thesis aims to develop techniques that enable relational and NoSQL databases to take full advantage of the envisioned low-cost heterogeneous DRAM system. As the first step, we studied how one could employ heterogeneous DRAM to implement a low-cost tiered caching solution for relational database, and obtained encouraging results using MySQL as a test vehicle.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"60 1","pages":"2054-2058"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75053755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Efficient Structural Clustering in Large Uncertain Graphs 大型不确定图的高效结构聚类
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00215
Yongjiang Liang, Tingting Hu, Peixiang Zhao
Clustering uncertain graphs based on the probabilistic graph model has sparked extensive research and widely varying applications. Existing structural clustering methods rely heavily on the computation of pairwise reliable structural similarity between vertices, which has proven to be extremely costly, especially in large uncertain graphs. In this paper, we develop a new, decomposition-based method, ProbSCAN, for efficient reliable structural similarity computation with theoretically improved complexity. We further design a cost-effective index structure UCNO-Index, and a series of powerful pruning strategies to expedite reliable structural similarity computation in uncertain graphs. Experimental studies on eight real-world uncertain graphs demonstrate the effectiveness of our proposed solutions, which achieves orders of magnitude improvement of clustering efficiency, compared with the state-of-the-art structural clustering methods in large uncertain graphs.
基于概率图模型的不确定图聚类已经引起了广泛的研究和广泛的应用。现有的结构聚类方法严重依赖于计算顶点之间的两两可靠结构相似度,这被证明是非常昂贵的,特别是在大型不确定图中。在本文中,我们开发了一种新的,基于分解的方法,ProbSCAN,高效可靠的结构相似度计算,理论上提高了复杂度。我们进一步设计了一种经济高效的索引结构UCNO-Index,以及一系列强大的剪枝策略,以加快不确定图中可靠的结构相似度计算。对8个真实不确定图的实验研究证明了我们提出的解决方案的有效性,与目前最先进的大型不确定图的结构聚类方法相比,聚类效率得到了数量级的提高。
{"title":"Efficient Structural Clustering in Large Uncertain Graphs","authors":"Yongjiang Liang, Tingting Hu, Peixiang Zhao","doi":"10.1109/ICDE48307.2020.00215","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00215","url":null,"abstract":"Clustering uncertain graphs based on the probabilistic graph model has sparked extensive research and widely varying applications. Existing structural clustering methods rely heavily on the computation of pairwise reliable structural similarity between vertices, which has proven to be extremely costly, especially in large uncertain graphs. In this paper, we develop a new, decomposition-based method, ProbSCAN, for efficient reliable structural similarity computation with theoretically improved complexity. We further design a cost-effective index structure UCNO-Index, and a series of powerful pruning strategies to expedite reliable structural similarity computation in uncertain graphs. Experimental studies on eight real-world uncertain graphs demonstrate the effectiveness of our proposed solutions, which achieves orders of magnitude improvement of clustering efficiency, compared with the state-of-the-art structural clustering methods in large uncertain graphs.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"32 1","pages":"1966-1969"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79769453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Crowdsourcing-based Data Extraction from Visualization Charts 基于众包的可视化图表数据提取
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00177
Chengliang Chai, Guoliang Li, Ju Fan, Yuyu Luo
Visualization charts are widely utilized for presenting structured data. Under many circumstances, people want to explore the data in the charts collected from various sources, such as papers and websites, so as to further analyzing the data or creating new charts. However, the existing automatic and semi-automatic approaches are not always effective due to the variety of charts. In this paper, we introduce a crowdsourcing approach that leverages human ability to extract data from visualization charts. There are several challenges. The first one is how to avoid tedious human interaction with charts and design simple crowdsourcing tasks. Second, it is challenging to evaluate worker’s quality for truth inference, because workers may not only provide inaccurate values but also misalign values to wrong data series. To address the challenges, we design an effective crowdsourcing task scheme that splits a chart into simple micro-tasks. We introduce a novel worker quality model by considering worker’s accuracy and task difficulty. We also devise an effective early-stopping mechanisms to save the cost. We have conducted experiments on a real crowdsourcing platform, and the results show that our framework outperforms state-of-the-art approaches on both cost and quality.
可视化图表被广泛用于表示结构化数据。在很多情况下,人们想要探索从各种来源收集的图表中的数据,例如论文和网站,从而进一步分析数据或创建新的图表。然而,由于图表的多样性,现有的自动和半自动方法并不总是有效的。在本文中,我们介绍了一种利用人类能力从可视化图表中提取数据的众包方法。这里有几个挑战。第一个问题是如何避免与图表进行繁琐的人机交互,并设计简单的众包任务。其次,评估工作者的真值推断质量是具有挑战性的,因为工作者不仅可能提供不准确的值,而且可能将值与错误的数据序列不一致。为了应对这些挑战,我们设计了一个有效的众包任务方案,将图表分成简单的微任务。在此基础上,提出了一种考虑工人精度和任务难度的工人素质模型。我们还设计了一个有效的早期停止机制,以节省成本。我们在一个真正的众包平台上进行了实验,结果表明我们的框架在成本和质量上都优于最先进的方法。
{"title":"Crowdsourcing-based Data Extraction from Visualization Charts","authors":"Chengliang Chai, Guoliang Li, Ju Fan, Yuyu Luo","doi":"10.1109/ICDE48307.2020.00177","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00177","url":null,"abstract":"Visualization charts are widely utilized for presenting structured data. Under many circumstances, people want to explore the data in the charts collected from various sources, such as papers and websites, so as to further analyzing the data or creating new charts. However, the existing automatic and semi-automatic approaches are not always effective due to the variety of charts. In this paper, we introduce a crowdsourcing approach that leverages human ability to extract data from visualization charts. There are several challenges. The first one is how to avoid tedious human interaction with charts and design simple crowdsourcing tasks. Second, it is challenging to evaluate worker’s quality for truth inference, because workers may not only provide inaccurate values but also misalign values to wrong data series. To address the challenges, we design an effective crowdsourcing task scheme that splits a chart into simple micro-tasks. We introduce a novel worker quality model by considering worker’s accuracy and task difficulty. We also devise an effective early-stopping mechanisms to save the cost. We have conducted experiments on a real crowdsourcing platform, and the results show that our framework outperforms state-of-the-art approaches on both cost and quality.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1814-1817"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82983461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
FlashSchema: Achieving High Quality XML Schemas with Powerful Inference Algorithms and Large-scale Schema Data FlashSchema:通过强大的推理算法和大规模的模式数据实现高质量的XML模式
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00214
Yeting Li, Jialun Cao, H. Chen, Tingjian Ge, Zhiwu Xu, Qiancheng Peng
Getting high quality XML schemas to avoid or reduce application risks is an important problem in practice, for which some important aspects have yet to be addressed satisfactorily in existing work. In this paper, we propose a tool FlashSchema for high quality XML schema design, which supports both one-pass and interactive schema design and schema recommendation. To the best of our knowledge, no other existing tools support interactive schema design and schema recommendation. One salient feature of our work is the design of algorithms to infer k-occurrence interleaving regular expressions, which are not only more powerful in model capacity, but also more efficient. Additionally, such algorithms form the basis of our interactive schema design. The other feature is that, starting from large-scale schema data that we have harvested from the Web, we devise a new solution for type inference, as well as propose schema recommendation for schema design. Finally, we conduct a series of experiments on two XML datasets, comparing with 9 state-of-the-art algorithms and open-source tools in terms of running time, preciseness, and conciseness. Experimental results show that our work achieves the highest level of preciseness and conciseness within only a few seconds. Experimental results and examples also demonstrate the effectiveness of our type inference and schema recommendation methods.
在实践中,获得高质量的XML模式以避免或减少应用程序风险是一个重要的问题,在现有的工作中,一些重要的方面还没有得到令人满意的解决。在本文中,我们提出了一个用于高质量XML模式设计的工具FlashSchema,该工具支持一次性和交互式模式设计以及模式推荐。据我们所知,没有其他现有工具支持交互式模式设计和模式推荐。我们工作的一个显著特征是设计了推断k次出现的交错正则表达式的算法,这不仅在模型容量上更强大,而且效率更高。此外,这些算法构成了交互式模式设计的基础。另一个特点是,从我们从Web上获得的大规模模式数据出发,我们设计了一种新的类型推断解决方案,并为模式设计提出了模式建议。最后,我们在两个XML数据集上进行了一系列实验,在运行时间、精确性和简洁性方面与9种最先进的算法和开源工具进行了比较。实验结果表明,我们的工作在几秒钟内就达到了最高的精确度和简洁性。实验结果和实例也证明了我们的类型推理和模式推荐方法的有效性。
{"title":"FlashSchema: Achieving High Quality XML Schemas with Powerful Inference Algorithms and Large-scale Schema Data","authors":"Yeting Li, Jialun Cao, H. Chen, Tingjian Ge, Zhiwu Xu, Qiancheng Peng","doi":"10.1109/ICDE48307.2020.00214","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00214","url":null,"abstract":"Getting high quality XML schemas to avoid or reduce application risks is an important problem in practice, for which some important aspects have yet to be addressed satisfactorily in existing work. In this paper, we propose a tool FlashSchema for high quality XML schema design, which supports both one-pass and interactive schema design and schema recommendation. To the best of our knowledge, no other existing tools support interactive schema design and schema recommendation. One salient feature of our work is the design of algorithms to infer k-occurrence interleaving regular expressions, which are not only more powerful in model capacity, but also more efficient. Additionally, such algorithms form the basis of our interactive schema design. The other feature is that, starting from large-scale schema data that we have harvested from the Web, we devise a new solution for type inference, as well as propose schema recommendation for schema design. Finally, we conduct a series of experiments on two XML datasets, comparing with 9 state-of-the-art algorithms and open-source tools in terms of running time, preciseness, and conciseness. Experimental results show that our work achieves the highest level of preciseness and conciseness within only a few seconds. Experimental results and examples also demonstrate the effectiveness of our type inference and schema recommendation methods.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"820 1","pages":"1962-1965"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80995124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Being Happy with the Least: Achieving α-happiness with Minimum Number of Tuples 用最少的元组获得α-幸福
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00092
Min Xie, R. C. Wong, Peng Peng, V. Tsotras
When faced with a database containing millions of products, a user may be only interested in a (typically much) smaller representative subset. Various approaches were proposed to create a good representative subset that fits the user’s needs which are expressed in the form of a utility function (e.g., the top-k and diversification query). Recently, a regret minimization query was proposed: it does not require users to provide their utility functions and returns a small set of tuples such that any user’s favorite tuple in this subset is guaranteed to be not much worse than his/her favorite tuple in the whole database. In a sense, this query finds a small set of tuples that makes the user happy (i.e., not regretful) even if s/he gets the best tuple in the selected set but not the best tuple among all tuples in the database.In this paper, we study the min-size version of the regret minimization query; that is, we want to determine the least tuples needed to keep users happy at a given level. We term this problem as the α-happiness query where we quantify the user’s happiness level by a criterion, called the happiness ratio, and guarantee that each user is at least α happy with the set returned (i.e., the happiness ratio is at least α) where α is a real number from 0 to 1. As this is an NP-hard problem, we derive an approximate solution with theoretical guarantee by considering the problem from a geometric perspective. Since in practical scenarios, users are interested in achieving higher happiness levels (i.e., α is closer to 1), we performed extensive experiments for these scenarios, using both real and synthetic datasets. Our evaluations show that our algorithm outperforms the best-known previous approaches in two ways: (i) it answers the α-happiness query by returning fewer tuples to users and, (ii) it answers much faster (up to two orders of magnitude times improvement for large α).
当面对包含数百万个产品的数据库时,用户可能只对(通常是)较小的代表性子集感兴趣。提出了各种方法来创建适合用户需求的良好代表性子集,这些需求以效用函数的形式表示(例如,top-k和多样化查询)。最近,提出了一种遗憾最小化查询:它不要求用户提供他们的实用函数,并返回一个小的元组集,这样任何用户在这个子集中最喜欢的元组都保证不会比他/她在整个数据库中最喜欢的元组差太多。从某种意义上说,这个查询找到了一小部分让用户满意(即不会后悔)的元组,即使他/她得到了所选集合中最好的元组,但不是数据库中所有元组中最好的元组。在本文中,我们研究了最小尺寸版本的后悔最小化查询;也就是说,我们想要确定在给定级别上保持用户满意所需的最小元组。我们将这个问题称为α-幸福查询,其中我们通过一个称为幸福比率的标准来量化用户的幸福水平,并保证每个用户对返回的集合至少感到α满意(即,幸福比率至少为α),其中α是从0到1的实数。由于这是一个np困难问题,我们从几何角度考虑问题,得到了一个具有理论保证的近似解。由于在实际场景中,用户对获得更高的幸福水平(即,α更接近1)感兴趣,因此我们使用真实和合成数据集对这些场景进行了广泛的实验。我们的评估表明,我们的算法在两个方面优于之前最著名的方法:(i)它通过向用户返回更少的元组来回答α-幸福查询,(ii)它的回答速度要快得多(对于大α,它的速度提高了两个数量级)。
{"title":"Being Happy with the Least: Achieving α-happiness with Minimum Number of Tuples","authors":"Min Xie, R. C. Wong, Peng Peng, V. Tsotras","doi":"10.1109/ICDE48307.2020.00092","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00092","url":null,"abstract":"When faced with a database containing millions of products, a user may be only interested in a (typically much) smaller representative subset. Various approaches were proposed to create a good representative subset that fits the user’s needs which are expressed in the form of a utility function (e.g., the top-k and diversification query). Recently, a regret minimization query was proposed: it does not require users to provide their utility functions and returns a small set of tuples such that any user’s favorite tuple in this subset is guaranteed to be not much worse than his/her favorite tuple in the whole database. In a sense, this query finds a small set of tuples that makes the user happy (i.e., not regretful) even if s/he gets the best tuple in the selected set but not the best tuple among all tuples in the database.In this paper, we study the min-size version of the regret minimization query; that is, we want to determine the least tuples needed to keep users happy at a given level. We term this problem as the α-happiness query where we quantify the user’s happiness level by a criterion, called the happiness ratio, and guarantee that each user is at least α happy with the set returned (i.e., the happiness ratio is at least α) where α is a real number from 0 to 1. As this is an NP-hard problem, we derive an approximate solution with theoretical guarantee by considering the problem from a geometric perspective. Since in practical scenarios, users are interested in achieving higher happiness levels (i.e., α is closer to 1), we performed extensive experiments for these scenarios, using both real and synthetic datasets. Our evaluations show that our algorithm outperforms the best-known previous approaches in two ways: (i) it answers the α-happiness query by returning fewer tuples to users and, (ii) it answers much faster (up to two orders of magnitude times improvement for large α).","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"95 1","pages":"1009-1020"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82782539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Online Indices for Predictive Top-k Entity and Aggregate Queries on Knowledge Graphs 知识图谱上预测Top-k实体和聚合查询的在线索引
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00096
Yan Li, Tingjian Ge, Cindy X. Chen
Knowledge graphs have seen increasingly broad applications. However, they are known to be incomplete. We define the notion of a virtual knowledge graph which extends a knowledge graph with predicted edges and their probabilities. We focus on two important types of queries: top-k entity queries and aggregate queries. To improve query processing efficiency, we propose an incremental index on top of low dimensional entity vectors transformed from network embedding vectors. We also devise query processing algorithms with the index. Moreover, we provide theoretical guarantees of accuracy, and conduct a systematic experimental evaluation. The experiments show that our approach is very efficient and effective. In particular, with the same or better accuracy guarantees, it is one to two orders of magnitude faster in query processing than the closest previous work which can only handle one relationship type.
知识图谱的应用越来越广泛。然而,他们是不完整的。我们定义了虚拟知识图的概念,它扩展了具有预测边及其概率的知识图。我们关注两种重要的查询类型:top-k实体查询和聚合查询。为了提高查询处理效率,我们提出了一种基于低维实体向量的增量索引方法。我们还设计了使用索引的查询处理算法。此外,我们提供了准确性的理论保证,并进行了系统的实验评估。实验结果表明,该方法是非常有效的。特别是,在相同或更好的准确性保证下,它在查询处理方面比之前只能处理一种关系类型的最接近的工作快一到两个数量级。
{"title":"Online Indices for Predictive Top-k Entity and Aggregate Queries on Knowledge Graphs","authors":"Yan Li, Tingjian Ge, Cindy X. Chen","doi":"10.1109/ICDE48307.2020.00096","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00096","url":null,"abstract":"Knowledge graphs have seen increasingly broad applications. However, they are known to be incomplete. We define the notion of a virtual knowledge graph which extends a knowledge graph with predicted edges and their probabilities. We focus on two important types of queries: top-k entity queries and aggregate queries. To improve query processing efficiency, we propose an incremental index on top of low dimensional entity vectors transformed from network embedding vectors. We also devise query processing algorithms with the index. Moreover, we provide theoretical guarantees of accuracy, and conduct a systematic experimental evaluation. The experiments show that our approach is very efficient and effective. In particular, with the same or better accuracy guarantees, it is one to two orders of magnitude faster in query processing than the closest previous work which can only handle one relationship type.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"20 1","pages":"1057-1068"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89430975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2020 IEEE 36th International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1