2020 IEEE 36th International Conference on Data Engineering (ICDE)最新文献_第9页

SuRF: Identification of Interesting Data Regions with Surrogate Models 用代理模型识别感兴趣的数据区域

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00118

Fotis Savva, C. Anagnostopoulos, P. Triantafillou

Several data mining tasks focus on repeatedly inspecting multidimensional data regions summarized by a statistic. The value of this statistic (e.g., region-population sizes, order moments) is used to classify the region’s interesting-ness. These regions can be naively extracted from the entire dataspace – however, this is extremely time-consuming and compute-resource demanding. This paper studies the reverse problem: analysts provide a cut-off value for a statistic of interest and in turn our proposed framework efficiently identifies multidimensional regions whose statistic exceeds (or is below) the given cut-off value (according to user’s needs). However, as data dimensions and size increase, such task inevitably becomes laborious and costly. To alleviate this cost, our solution, coined SuRF (SUrrogate Region Finder), leverages historical region evaluations to train surrogate models that learn to approximate the distribution of the statistic of interest. It then makes use of evolutionary multi-modal optimization to effectively and efficiently identify regions of interest regardless of data size and dimensionality. The accuracy, efficiency, and scalability of our approach are demonstrated with experiments using synthetic and real-world datasets and compared with other methods.

一些数据挖掘任务侧重于重复检查由统计量汇总的多维数据区域。该统计值(例如，区域人口大小，阶矩)用于对区域的兴趣进行分类。这些区域可以从整个数据空间中简单地提取出来——但是，这非常耗时，而且需要大量的计算资源。本文研究的是相反的问题:分析人员为感兴趣的统计量提供一个截止值，然后我们提出的框架有效地识别其统计量超过(或低于)给定的截止值(根据用户的需要)的多维区域。然而，随着数据维度和大小的增加，这样的任务不可避免地变得费力和昂贵。为了减轻这个成本，我们的解决方案，称为SuRF(代理区域查找器)，利用历史区域评估来训练代理模型，这些模型学习近似感兴趣的统计分布。然后，它利用进化多模态优化来有效地识别感兴趣的区域，而不考虑数据大小和维度。通过使用合成数据集和真实世界数据集的实验，并与其他方法进行了比较，证明了我们方法的准确性、效率和可扩展性。

{"title":"SuRF: Identification of Interesting Data Regions with Surrogate Models","authors":"Fotis Savva, C. Anagnostopoulos, P. Triantafillou","doi":"10.1109/ICDE48307.2020.00118","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00118","url":null,"abstract":"Several data mining tasks focus on repeatedly inspecting multidimensional data regions summarized by a statistic. The value of this statistic (e.g., region-population sizes, order moments) is used to classify the region’s interesting-ness. These regions can be naively extracted from the entire dataspace – however, this is extremely time-consuming and compute-resource demanding. This paper studies the reverse problem: analysts provide a cut-off value for a statistic of interest and in turn our proposed framework efficiently identifies multidimensional regions whose statistic exceeds (or is below) the given cut-off value (according to user’s needs). However, as data dimensions and size increase, such task inevitably becomes laborious and costly. To alleviate this cost, our solution, coined SuRF (SUrrogate Region Finder), leverages historical region evaluations to train surrogate models that learn to approximate the distribution of the statistic of interest. It then makes use of evolutionary multi-modal optimization to effectively and efficiently identify regions of interest regardless of data size and dimensionality. The accuracy, efficiency, and scalability of our approach are demonstrated with experiments using synthetic and real-world datasets and compared with other methods.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"27 1","pages":"1321-1332"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79871950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

VAC: Vertex-Centric Attributed Community Search 以顶点为中心的属性社区搜索

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00086

Qing Liu, Yifan Zhu, Minjun Zhao, Xin Huang, Jianliang Xu, Yunjun Gao

Attributed community search aims to find the community with strong structure and attribute cohesiveness from attributed graphs. However, existing works suffer from two major limitations: (i) it is not easy to set the conditions on query attributes; (ii) the queries support only a single type of attributes. To make up for these deficiencies, in this paper, we study a novel attributed community search called vertex-centric attributed community (VAC) search. Given an attributed graph and a query vertex set, the VAC search returns the community which is densely connected (ensured by the k-truss model) and has the best attribute score. We show that the problem is NP-hard. To answer the VAC search, we develop both exact and approximate algorithms. Specifically, we develop two exact algorithms. One searches the community in a depth-first manner and the other is in a best-first manner. We also propose a set of heuristic strategies to prune the unqualified search space by exploiting the structure and attribute properties. In addition, to further improve the search efficiency, we propose a 2-approximation algorithm. Comprehensive experimental studies on various realworld attributed graphs demonstrate the effectiveness of the proposed model and the efficiency of the developed algorithms.

属性社区搜索旨在从属性图中寻找具有较强结构和属性内聚性的社区。但是，现有的工作存在两大局限性:(1)查询属性设置条件不方便;(ii)查询只支持单一类型的属性。为了弥补这些不足，本文研究了一种新的属性社区搜索方法——以顶点为中心的属性社区搜索。给定一个属性图和一个查询顶点集，VAC搜索返回紧密连接(由k-truss模型保证)并且具有最佳属性得分的社区。我们证明了这个问题是np困难的。为了回答VAC搜索，我们开发了精确和近似算法。具体来说，我们开发了两种精确的算法。一个以深度优先的方式搜索社区，另一个以最佳优先的方式搜索社区。我们还提出了一套启发式策略，利用结构和属性属性来修剪不合格的搜索空间。此外，为了进一步提高搜索效率，我们提出了一种2逼近算法。对各种真实属性图的综合实验研究证明了所提出模型的有效性和所开发算法的效率。

{"title":"VAC: Vertex-Centric Attributed Community Search","authors":"Qing Liu, Yifan Zhu, Minjun Zhao, Xin Huang, Jianliang Xu, Yunjun Gao","doi":"10.1109/ICDE48307.2020.00086","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00086","url":null,"abstract":"Attributed community search aims to find the community with strong structure and attribute cohesiveness from attributed graphs. However, existing works suffer from two major limitations: (i) it is not easy to set the conditions on query attributes; (ii) the queries support only a single type of attributes. To make up for these deficiencies, in this paper, we study a novel attributed community search called vertex-centric attributed community (VAC) search. Given an attributed graph and a query vertex set, the VAC search returns the community which is densely connected (ensured by the k-truss model) and has the best attribute score. We show that the problem is NP-hard. To answer the VAC search, we develop both exact and approximate algorithms. Specifically, we develop two exact algorithms. One searches the community in a depth-first manner and the other is in a best-first manner. We also propose a set of heuristic strategies to prune the unqualified search space by exploiting the structure and attribute properties. In addition, to further improve the search efficiency, we propose a 2-approximation algorithm. Comprehensive experimental studies on various realworld attributed graphs demonstrate the effectiveness of the proposed model and the efficiency of the developed algorithms.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"50 1","pages":"937-948"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79950163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Skyline Cohesive Group Queries in Large Road-social Networks 大型道路社交网络中的Skyline内聚组查询

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00041

Qiyan Li, Yuanyuan Zhu, J. Yu

Given a network with social and spatial information, cohesive group queries aim at finding a group of users, which are strongly connected and closely co-located. Most existing studies limit to finding groups either with the strongest social ties under certain spatial constraint or minimum spatial distance under certain social constraints. It is difficult for users to decide which constraints they need to choose and how to decide the priority of the constraints to meet their real requirements since the social constraint and spatial constraint are different in nature. In this paper, we take a new approach to consider the constraints equally and study a skyline query. Specifically, given a road-social network consisting of a road network Gr and a location-based social network Gs, we aim to find a set of skyline cohesive groups, in which each group cannot be dominated by any other group in terms of social cohesiveness and spatial cohesiveness. We find a group of users using social cohesiveness based on (k, c)-core (a k-core of size c) and spatial cohesiveness based on travel cost to a meeting point from group members. Such skyline problem is NP-hard as we need to explore the combinations of c vertices to check whether it is a qualified (k, c)-core. In this paper, we first provide exact solutions by developing efficient pruning strategies to filter out a large number of combinations which cannot form a (k, c)-core, and then propose highly efficient greedy solutions based on a newly designed cd-tree to keep the distance on the road network and social structural information simultaneously. Experimental results show that our exact methods run faster than the brute-force methods by 2-4 orders of magnitude in general, and our cd-tree based greedy methods can significantly reduce the computation cost by 1-4 order of magnitude while the extra travel cost is less than 5% compared to the exact method on multiple real road-social networks.

给定一个具有社会和空间信息的网络，内聚组查询的目的是寻找一组用户，这些用户是紧密联系在一起的。现有的研究大多局限于寻找在一定空间约束下具有最强社会联系的群体或在一定社会约束下具有最小空间距离的群体。由于社会约束和空间约束的性质不同，用户很难决定他们需要选择哪些约束，以及如何决定约束的优先级以满足他们的实际需求。在本文中，我们采用了一种新的方法来平等地考虑约束条件并研究一个天际线查询。具体来说，给定一个由道路网络Gr和基于位置的社交网络Gs组成的道路社会网络，我们的目标是找到一组天际线凝聚力群体，其中每个群体在社会凝聚力和空间凝聚力方面都不受任何其他群体的支配。我们使用基于(k, c)-核心(大小为c的k-核心)的社会凝聚力和基于从小组成员到会议点的旅行成本的空间凝聚力找到了一组用户。这样的天际线问题是np困难的，因为我们需要探索c个顶点的组合来检查它是否是一个合格的(k, c)-核。在本文中，我们首先通过开发高效的剪枝策略来过滤掉大量不能形成a (k, c)-核的组合，从而提供精确的解，然后基于新设计的cd-树提出高效的贪婪解，同时保持路网上的距离和社会结构信息。实验结果表明，我们的精确方法总体上比暴力方法快2-4个数量级，而我们基于cd树的贪婪方法在多个真实道路社交网络上的计算成本比精确方法显著降低了1-4个数量级，而额外的旅行成本低于5%。

{"title":"Skyline Cohesive Group Queries in Large Road-social Networks","authors":"Qiyan Li, Yuanyuan Zhu, J. Yu","doi":"10.1109/ICDE48307.2020.00041","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00041","url":null,"abstract":"Given a network with social and spatial information, cohesive group queries aim at finding a group of users, which are strongly connected and closely co-located. Most existing studies limit to finding groups either with the strongest social ties under certain spatial constraint or minimum spatial distance under certain social constraints. It is difficult for users to decide which constraints they need to choose and how to decide the priority of the constraints to meet their real requirements since the social constraint and spatial constraint are different in nature. In this paper, we take a new approach to consider the constraints equally and study a skyline query. Specifically, given a road-social network consisting of a road network Gr and a location-based social network Gs, we aim to find a set of skyline cohesive groups, in which each group cannot be dominated by any other group in terms of social cohesiveness and spatial cohesiveness. We find a group of users using social cohesiveness based on (k, c)-core (a k-core of size c) and spatial cohesiveness based on travel cost to a meeting point from group members. Such skyline problem is NP-hard as we need to explore the combinations of c vertices to check whether it is a qualified (k, c)-core. In this paper, we first provide exact solutions by developing efficient pruning strategies to filter out a large number of combinations which cannot form a (k, c)-core, and then propose highly efficient greedy solutions based on a newly designed cd-tree to keep the distance on the road network and social structural information simultaneously. Experimental results show that our exact methods run faster than the brute-force methods by 2-4 orders of magnitude in general, and our cd-tree based greedy methods can significantly reduce the computation cost by 1-4 order of magnitude while the extra travel cost is less than 5% compared to the exact method on multiple real road-social networks.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"94 1","pages":"397-408"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83905313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

An Agile Sample Maintenance Approach for Agile Analytics 敏捷分析的敏捷样本维护方法

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00071

Hanbing Zhang, Yazhong Zhang, Zhenying He, Yinan Jing, Kai Zhang, X. S. Wang

Agile analytics can help organizations to gain and sustain a competitive advantage by making timely decisions. Approximate query processing (AQP) is one of the useful approaches in agile analytics, which facilitates fast queries on big data by leveraging a pre-computed sample. One problem such a sample faces is that when new data is being imported, re-sampling is most likely needed to keep the sample fresh and AQP results accurate enough. Re-sampling from scratch for every batch of new data, called the full re-sampling method and adopted by many existing AQP works, is obviously a very costly process, and a much quicker incremental sampling process, such as reservoir sampling, may be used to cover the newly arrived data. However, incremental update methods suffer from the fact that the sample size cannot be increased, which is a problem when the underlying data distribution dramatically changes and the sample needs to be enlarged to maintain the AQP accuracy. This paper proposes an adaptive sample update (ASU) approach that avoids re-sampling from scratch as much as possible by monitoring the data distribution, and uses instead an incremental update method before a re-sampling becomes necessary. The paper also proposes an enhanced approach (T-ASU), which tries to enlarge the sample size without re-sampling from scratch when a bit of query inaccuracy is tolerable to further reduce the sample update cost. These two approaches are integrated into a state-of-the-art AQP engine for an extensive experimental study. Experimental results on both real-world and synthetic datasets show that the two approaches are faster than the full re-sampling method while achieving almost the same AQP accuracy when the underlying data distribution continuously changes.

敏捷分析可以帮助组织通过做出及时的决策来获得并维持竞争优势。近似查询处理(AQP)是敏捷分析中的一种有用方法，它通过利用预先计算的样本来促进对大数据的快速查询。这种样本面临的一个问题是，当导入新数据时，很可能需要重新采样，以保持样本的新鲜度和AQP结果足够准确。对每一批新数据从头开始重新采样，称为全重采样方法，许多现有AQP工作采用这种方法，显然是一个非常昂贵的过程，可以使用更快的增量采样过程，例如油藏采样，来覆盖新到达的数据。然而，增量更新方法的缺点是不能增加样本量，这是一个问题，当底层数据分布发生巨大变化，需要扩大样本以保持AQP精度时。本文提出了一种自适应样本更新(ASU)方法，通过监测数据分布尽可能避免从头开始重新采样，并在需要重新采样之前使用增量更新方法。本文还提出了一种增强方法(T-ASU)，该方法在可以容忍少量查询不准确的情况下，尝试在不重新采样的情况下扩大样本大小，以进一步降低样本更新成本。这两种方法被整合到最先进的AQP引擎中进行广泛的实验研究。在真实数据集和合成数据集上的实验结果表明，当底层数据分布连续变化时，这两种方法都比完全重采样方法更快，同时获得几乎相同的AQP精度。

{"title":"An Agile Sample Maintenance Approach for Agile Analytics","authors":"Hanbing Zhang, Yazhong Zhang, Zhenying He, Yinan Jing, Kai Zhang, X. S. Wang","doi":"10.1109/ICDE48307.2020.00071","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00071","url":null,"abstract":"Agile analytics can help organizations to gain and sustain a competitive advantage by making timely decisions. Approximate query processing (AQP) is one of the useful approaches in agile analytics, which facilitates fast queries on big data by leveraging a pre-computed sample. One problem such a sample faces is that when new data is being imported, re-sampling is most likely needed to keep the sample fresh and AQP results accurate enough. Re-sampling from scratch for every batch of new data, called the full re-sampling method and adopted by many existing AQP works, is obviously a very costly process, and a much quicker incremental sampling process, such as reservoir sampling, may be used to cover the newly arrived data. However, incremental update methods suffer from the fact that the sample size cannot be increased, which is a problem when the underlying data distribution dramatically changes and the sample needs to be enlarged to maintain the AQP accuracy. This paper proposes an adaptive sample update (ASU) approach that avoids re-sampling from scratch as much as possible by monitoring the data distribution, and uses instead an incremental update method before a re-sampling becomes necessary. The paper also proposes an enhanced approach (T-ASU), which tries to enlarge the sample size without re-sampling from scratch when a bit of query inaccuracy is tolerable to further reduce the sample update cost. These two approaches are integrated into a state-of-the-art AQP engine for an extensive experimental study. Experimental results on both real-world and synthetic datasets show that the two approaches are faster than the full re-sampling method while achieving almost the same AQP accuracy when the underlying data distribution continuously changes.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"117 1","pages":"757-768"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89337750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improved Correlated Sampling for Join Size Estimation 用于连接大小估计的改进相关抽样

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00035

Taining Wang, C. Chan

Recent research on sampling-based join size estimation has focused on a promising new technique known as correlated sampling. While several variants of this technique have been proposed, there is a lack of a systematic study of this family of techniques. In this paper, we first introduce a framework to characterize its design space in terms of five parameters. Based on this framework, we propose a new correlated sampling based technique to address the limitations of existing techniques. Our new technique is based on using a discrete learning method for estimating the join size from samples. We experimentally compare the performance of multiple variants of our new technique and identify a hybrid variant that provides the best estimation quality. This hybrid variant not only outperforms the state-of-the-art correlated sampling technique, but it is also more robust to small samples and skewed data.

最近对基于采样的连接大小估计的研究集中在一种很有前途的新技术上，即相关采样。虽然已经提出了该技术的几种变体，但缺乏对该技术家族的系统研究。在本文中，我们首先引入了一个框架，以五个参数来表征其设计空间。基于此框架，我们提出了一种新的基于相关采样的技术来解决现有技术的局限性。我们的新技术是基于使用离散学习方法来估计样本的连接大小。我们通过实验比较了我们的新技术的多个变体的性能，并确定了提供最佳估计质量的混合变体。这种混合变体不仅优于最先进的相关采样技术，而且对小样本和倾斜数据也更健壮。

引用次数: 10

JUST: JD Urban Spatio-Temporal Data Engine JUST: JD城市时空数据引擎

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00138

Ruiyuan Li, Huajun He, Rubin Wang, Yuchuan Huang, Junwen Liu, Sijie Ruan, Tianfu He, Jie Bao, Yu Zheng

With the prevalence of positioning techniques, a prodigious number of spatio-temporal data is generated con-stantly. To effectively support sophisticated urban applications, e.g., location-based services, based on spatio-temporal data, it is desirable for an efficient, scalable, update-enabled, and easy-to-use spatio-temporal data management system.This paper presents JUST, i.e., JD Urban Spatio-Temporal data engine, which can efficiently manage big spatio-temporal data in a convenient way. JUST incorporates the distributed NoSQL data store, i.e., Apache HBase, as the underlying storage, GeoMesa as the spatio-temporal data indexing tool, and Apache Spark as the execution engine. We creatively design two indexing techniques, i.e., Z2T and XZ2T, which accelerates spatio-temporal queries tremendously. Furthermore, we introduce a compression mechanism, which not only greatly reduces the storage cost, but also improves the query efficiency. To make JUST easy-to-use, we design and implement a complete SQL engine, with which all operations can be performed through a SQL-like query language, i.e., JustQL. JUST also supports inherently new data insertions and historical data updates without index reconstruction. JUST is deployed as a PaaS in JD with multi-users support. Many applications have been developed based on the SDKs provided by JUST. Extensive experiments are carried out with six state-of-the-art distributed spatio-temporal data management systems based on two real datasets and one synthetic dataset. The results show that JUST has a competitive query performance and is much more scalable than them.

随着定位技术的发展，不断产生大量的时空数据。为了有效地支持复杂的城市应用，例如基于时空数据的基于位置的服务，需要一个高效、可扩展、可更新且易于使用的时空数据管理系统。本文提出了JD城市时空数据引擎JUST(即JD Urban spatial -temporal data engine)，该引擎能够高效、便捷地管理大时空数据。JUST集成了分布式NoSQL数据存储，即Apache HBase作为底层存储，GeoMesa作为时空数据索引工具，Apache Spark作为执行引擎。我们创造性地设计了Z2T和XZ2T两种索引技术，极大地加快了时空查询的速度。此外，我们还引入了压缩机制，不仅大大降低了存储成本，而且提高了查询效率。为了使JUST易于使用，我们设计并实现了一个完整的SQL引擎，所有的操作都可以通过一种类似SQL的查询语言，即JustQL来执行。JUST还支持无需索引重建的新数据插入和历史数据更新。JUST作为平台即服务(PaaS)部署在京东，支持多用户。许多应用程序都是基于JUST提供的sdk开发的。在2个真实数据集和1个合成数据集的基础上，对6个最先进的分布式时空数据管理系统进行了广泛的实验。结果表明，JUST具有相当的查询性能，并且比它们具有更高的可扩展性。

{"title":"JUST: JD Urban Spatio-Temporal Data Engine","authors":"Ruiyuan Li, Huajun He, Rubin Wang, Yuchuan Huang, Junwen Liu, Sijie Ruan, Tianfu He, Jie Bao, Yu Zheng","doi":"10.1109/ICDE48307.2020.00138","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00138","url":null,"abstract":"With the prevalence of positioning techniques, a prodigious number of spatio-temporal data is generated con-stantly. To effectively support sophisticated urban applications, e.g., location-based services, based on spatio-temporal data, it is desirable for an efficient, scalable, update-enabled, and easy-to-use spatio-temporal data management system.This paper presents JUST, i.e., JD Urban Spatio-Temporal data engine, which can efficiently manage big spatio-temporal data in a convenient way. JUST incorporates the distributed NoSQL data store, i.e., Apache HBase, as the underlying storage, GeoMesa as the spatio-temporal data indexing tool, and Apache Spark as the execution engine. We creatively design two indexing techniques, i.e., Z2T and XZ2T, which accelerates spatio-temporal queries tremendously. Furthermore, we introduce a compression mechanism, which not only greatly reduces the storage cost, but also improves the query efficiency. To make JUST easy-to-use, we design and implement a complete SQL engine, with which all operations can be performed through a SQL-like query language, i.e., JustQL. JUST also supports inherently new data insertions and historical data updates without index reconstruction. JUST is deployed as a PaaS in JD with multi-users support. Many applications have been developed based on the SDKs provided by JUST. Extensive experiments are carried out with six state-of-the-art distributed spatio-temporal data management systems based on two real datasets and one synthetic dataset. The results show that JUST has a competitive query performance and is much more scalable than them.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"1558-1569"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86783073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Efficient Bidirectional Order Dependency Discovery 有效的双向顺序依赖发现

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00013

Yifeng Jin, Lin Zhu, Zijing Tan

Bidirectional order dependencies state relationships of order between lists of attributes. They naturally model the order-by clauses in SQL queries, and are proved effective in query optimizations concerning sorting. Despite their importance, order dependencies on a dataset are typically unknown and are too costly, if not impossible, to design or discover manually. Techniques for automatic order dependency discovery are recently studied. It is challenging for order dependency discovery to scale well, since it is by nature factorial in the number m of attributes and quadratic in the number n of tuples. In this paper, we adopt a strategy that decouples the impact of m from that of n, and that still finds all minimal valid bidirectional order dependencies. We present carefully designed data structures, a host of algorithms and optimizations, for efficient order dependency discovery. With extensive experimental studies on both real-life and synthetic datasets, we verify our approach significantly outperforms state-of-the-art techniques, by orders of magnitude.

双向顺序依赖关系表示属性列表之间的顺序关系。它们自然地为SQL查询中的order-by子句建模，并且在有关排序的查询优化中被证明是有效的。尽管它们很重要，但数据集上的顺序依赖关系通常是未知的，而且手工设计或发现的成本太高(如果不是不可能的话)。自动发现订单依赖关系的技术最近得到了研究。顺序依赖项发现很难很好地扩展，因为它本质上是属性数量m的阶乘和元组数量n的二次元。在本文中，我们采用了一种策略，将m的影响与n的影响解耦，并且仍然找到所有最小有效的双向顺序依赖。我们提出了精心设计的数据结构，大量的算法和优化，以有效地发现顺序依赖。通过对现实生活和合成数据集的广泛实验研究，我们验证了我们的方法在数量级上明显优于最先进的技术。

引用次数: 11

Reasoning about the Future in Blockchain Databases 关于区块链数据库未来的推理

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00206

Sara Cohen, Adam Rosenthal, Aviv Zohar

A key difference between using blockchains to store data and centrally controlled databases is that transactions are accepted to a blockchain via a consensus mechanism, and not by a controlling central party. Hence, once a user has issued a transaction, she cannot be certain if it will be accepted. Moreover, a yet unaccepted transaction cannot be retracted by the user, and may (or may not) be appended to the blockchain at any point in the future. This causes difficulties as the user may wish to formulate new transactions based on the knowledge of which previous transactions will be accepted. Yet this knowledge is inherently uncertain.We introduce a formal abstraction for blockchains as a data storage layer that underlies a database. The main issue that we tackle is the need to reason about possible worlds, due to the uncertainty in transaction appending. In particular, we consider the theoretical complexity of determining whether it is possible for a denial constraint to be contradicted, given the current state of the blockchain, pending transactions, and integrity constraints on blockchain data. We then present practical algorithms for this problem that work well in practice.

使用区块链存储数据和中央控制的数据库之间的一个关键区别是，交易是通过共识机制接受的，而不是由控制的中央方接受的。因此，一旦用户发出了一笔交易，她就不能确定这笔交易是否会被接受。此外，用户不能收回尚未接受的事务，并且可能(也可能不会)在将来的任何时候追加到区块链。这就造成了困难，因为用户可能希望在了解哪些以前的交易将被接受的基础上制定新的交易。然而，这种认识本身就是不确定的。我们为区块链引入了一种正式的抽象，作为数据库底层的数据存储层。我们处理的主要问题是，由于交易附加的不确定性，需要对可能的世界进行推理。特别是，我们考虑了在给定区块链数据的当前状态、待处理事务和完整性约束的情况下，确定拒绝约束是否可能被矛盾的理论复杂性。然后，我们提出了实用的算法来解决这个问题，在实践中效果很好。

{"title":"Reasoning about the Future in Blockchain Databases","authors":"Sara Cohen, Adam Rosenthal, Aviv Zohar","doi":"10.1109/ICDE48307.2020.00206","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00206","url":null,"abstract":"A key difference between using blockchains to store data and centrally controlled databases is that transactions are accepted to a blockchain via a consensus mechanism, and not by a controlling central party. Hence, once a user has issued a transaction, she cannot be certain if it will be accepted. Moreover, a yet unaccepted transaction cannot be retracted by the user, and may (or may not) be appended to the blockchain at any point in the future. This causes difficulties as the user may wish to formulate new transactions based on the knowledge of which previous transactions will be accepted. Yet this knowledge is inherently uncertain.We introduce a formal abstraction for blockchains as a data storage layer that underlies a database. The main issue that we tackle is the need to reason about possible worlds, due to the uncertainty in transaction appending. In particular, we consider the theoretical complexity of determining whether it is possible for a denial constraint to be contradicted, given the current state of the blockchain, pending transactions, and integrity constraints on blockchain data. We then present practical algorithms for this problem that work well in practice.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"42 1","pages":"1930-1933"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82272718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Task Deployment Recommendation with Worker Availability 具有工作者可用性的任务部署建议

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00175

Dong Wei, Senjuti Basu Roy, S. Amer-Yahia

We study recommendation of deployment strategies to task requesters that are consistent with their deployment parameters: a lower-bound on the quality of the crowd contribution, an upper-bound on the latency of task completion, and an upper-bound on the cost incurred by paying workers. We propose BatchStrat, an optimization-driven middle layer that recommends deployment strategies to a batch of requests by accounting for worker availability. We develop computationally efficient algorithms to recommend deployments that maximize task throughput and pay-off, and empirically validate its quality and scalability.

我们研究了与任务请求者的部署参数一致的部署策略建议:人群贡献质量的下限，任务完成延迟的上限，以及支付工作人员所产生的成本的上限。我们提出了BatchStrat，这是一个优化驱动的中间层，通过考虑工作人员可用性，为一批请求推荐部署策略。我们开发了计算效率高的算法，以推荐最大化任务吞吐量和回报的部署，并通过经验验证其质量和可扩展性。

引用次数: 2

Turbocharging Geospatial Visualization Dashboards via a Materialized Sampling Cube Approach 通过物化采样立方体方法涡轮增压地理空间可视化仪表板

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00105

Jia Yu, Mohamed Sarwat

In this paper, we present a middleware framework that runs on top of a SQL data system with the purpose of increasing the interactivity of geospatial visualization dashboards. The proposed system adopts a sampling cube approach that stores pre-materialized spatial samples and allows users to define their own accuracy loss function such that the produced samples can be used for various user-defined visualization tasks. The system ensures that the difference between the sample fed into the visualization dashboard and the raw query answer never exceeds the user-specified loss threshold. To reduce the number of cells in the sampling cube and hence mitigate the initialization time and memory utilization, the system employs two main strategies: (1) a partially materialized cube to only materialize local samples of those queries for which the global sample (the sample drawn from the entire dataset) exceeds the required accuracy loss threshold. (2) a sample selection technique that finds similarities between different local samples and only persists a few representative samples. Based on the extensive experimental evaluation, Tabula can bring down the total data-to-visualization time (including both data-system and visualization times) of a heat map generated over 700 million taxi rides to 600 milliseconds with 250 meters user-defined accuracy loss. Besides, Tabula costs up to two orders of magnitude less memory footprint (e.g., only 800 MB for the running example) and one order of magnitude less initialization time than the fully materialized sampling cube.

在本文中，我们提出了一个运行在SQL数据系统之上的中间件框架，其目的是增加地理空间可视化仪表板的交互性。所提出的系统采用采样立方体方法，存储预物化的空间样本，并允许用户定义自己的精度损失函数，以便生成的样本可用于各种用户定义的可视化任务。系统确保输入到可视化仪表板的样本与原始查询答案之间的差异永远不会超过用户指定的损失阈值。为了减少采样立方体中的单元数，从而减少初始化时间和内存占用，系统采用两种主要策略:(1)部分物化立方体只物化那些全局样本(从整个数据集提取的样本)超过所需精度损失阈值的查询的局部样本。(2)寻找不同地方样本之间的相似性，只保留少数代表性样本的样本选择技术。基于广泛的实验评估，Tabula可以将超过7亿次出租车行程生成的热图的总数据到可视化时间(包括数据系统和可视化时间)降低到600毫秒，用户定义的精度损失为250米。此外，与完全物化的采样立方体相比，Tabula的内存占用减少了两个数量级(例如，运行示例仅为800 MB)，初始化时间减少了一个数量级。

{"title":"Turbocharging Geospatial Visualization Dashboards via a Materialized Sampling Cube Approach","authors":"Jia Yu, Mohamed Sarwat","doi":"10.1109/ICDE48307.2020.00105","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00105","url":null,"abstract":"In this paper, we present a middleware framework that runs on top of a SQL data system with the purpose of increasing the interactivity of geospatial visualization dashboards. The proposed system adopts a sampling cube approach that stores pre-materialized spatial samples and allows users to define their own accuracy loss function such that the produced samples can be used for various user-defined visualization tasks. The system ensures that the difference between the sample fed into the visualization dashboard and the raw query answer never exceeds the user-specified loss threshold. To reduce the number of cells in the sampling cube and hence mitigate the initialization time and memory utilization, the system employs two main strategies: (1) a partially materialized cube to only materialize local samples of those queries for which the global sample (the sample drawn from the entire dataset) exceeds the required accuracy loss threshold. (2) a sample selection technique that finds similarities between different local samples and only persists a few representative samples. Based on the extensive experimental evaluation, Tabula can bring down the total data-to-visualization time (including both data-system and visualization times) of a heat map generated over 700 million taxi rides to 600 milliseconds with 250 meters user-defined accuracy loss. Besides, Tabula costs up to two orders of magnitude less memory footprint (e.g., only 800 MB for the running example) and one order of magnitude less initialization time than the fully materialized sampling cube.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"5 1","pages":"1165-1176"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72664573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0