2020 IEEE 36th International Conference on Data Engineering (ICDE)最新文献

英文中文

Summarizing Hierarchical Multidimensional Data 分层多维数据汇总

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00081

Alexandra Kim, L. Lakshmanan, D. Srivastava

Data scientists typically analyze and extract insights from large multidimensional data sets such as US census data, enterprise sales data, and so on. But before sophisticated machine learning and statistical methods are employed, it is useful to build and explore concise summaries of the data set. While a variety of summaries have been proposed over the years, the goal of creating a concise summary of multidimensional data that can provide worst-case accuracy guarantees has remained elusive. In this paper, we propose Tree Summaries, which attain this challenging goal over arbitrary hierarchical multidimensional data sets. Intuitively, a Tree Summary is a weighted "embedded tree" in the lattice that is the cross-product of the dimension hierarchies; individual data values can be efficiently estimated by looking up the weight of their unique closest ancestor in the Tree Summary. We study the problems of generating lossless as well as (given a desired worst-case accuracy guarantee a) lossy Tree Summaries. We develop a polynomial-time algorithm that constructs the optimal (i.e., most concise) Tree Summary for each of these problems; this is a surprising result given the NP-hardness of constructing a variety of other optimal summaries over multidimensional data. We complement our analytical results with an empirical evaluation of our algorithm, and demonstrate with a detailed set of experiments on real and synthetic data sets that our algorithm outperforms prior methods in terms of conciseness of summaries or accuracy of estimation.

数据科学家通常从大型多维数据集(如美国人口普查数据、企业销售数据等)中分析和提取见解。但在使用复杂的机器学习和统计方法之前，构建和探索数据集的简明摘要是有用的。虽然多年来已经提出了各种各样的摘要，但创建能够提供最坏情况准确性保证的多维数据的简明摘要的目标仍然难以实现。在本文中，我们提出了树摘要，它在任意层次多维数据集上实现了这一具有挑战性的目标。直观地说，Tree Summary是格中的加权“嵌入树”，是维度层次的交叉积;可以通过在Tree Summary中查找其唯一的最近祖先的权重来有效地估计单个数据值。我们研究了产生无损和(给定最坏情况精度保证a)有损树摘要的问题。我们开发了一个多项式时间算法，为每个问题构建最优(即最简洁)的树摘要;考虑到在多维数据上构造各种其他最优摘要的np -硬度，这是一个令人惊讶的结果。我们通过对算法的经验评估来补充我们的分析结果，并通过对真实和合成数据集的详细实验来证明，我们的算法在摘要的简洁性或估计的准确性方面优于先前的方法。

{"title":"Summarizing Hierarchical Multidimensional Data","authors":"Alexandra Kim, L. Lakshmanan, D. Srivastava","doi":"10.1109/ICDE48307.2020.00081","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00081","url":null,"abstract":"Data scientists typically analyze and extract insights from large multidimensional data sets such as US census data, enterprise sales data, and so on. But before sophisticated machine learning and statistical methods are employed, it is useful to build and explore concise summaries of the data set. While a variety of summaries have been proposed over the years, the goal of creating a concise summary of multidimensional data that can provide worst-case accuracy guarantees has remained elusive. In this paper, we propose Tree Summaries, which attain this challenging goal over arbitrary hierarchical multidimensional data sets. Intuitively, a Tree Summary is a weighted \"embedded tree\" in the lattice that is the cross-product of the dimension hierarchies; individual data values can be efficiently estimated by looking up the weight of their unique closest ancestor in the Tree Summary. We study the problems of generating lossless as well as (given a desired worst-case accuracy guarantee a) lossy Tree Summaries. We develop a polynomial-time algorithm that constructs the optimal (i.e., most concise) Tree Summary for each of these problems; this is a surprising result given the NP-hardness of constructing a variety of other optimal summaries over multidimensional data. We complement our analytical results with an empirical evaluation of our algorithm, and demonstrate with a detailed set of experiments on real and synthetic data sets that our algorithm outperforms prior methods in terms of conciseness of summaries or accuracy of estimation.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"103 1","pages":"877-888"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86657213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Scaling Out Schema-free Stream Joins 扩展无模式流连接

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00075

Damjan Gjurovski, S. Michel

In this work, we consider computing natural joins over massive streams of JSON documents that do not adhere to a specific schema. We first propose an efficient and scalable partitioning algorithm that uses the main principles of association analysis to identify patterns of co-occurrence of the attribute-value pairs within the documents. Data is then accordingly forwarded to compute nodes and locally joined using a novel FP-tree–based join algorithm. By compactly storing the documents and efficiently traversing the FP-tree structure, the proposed join algorithm can operate on large input sizes and provide results in real-time. We discuss data-dependent scalability limitations that are inherent to natural joins over schema-free data and show how to practically circumvent them by artificially expanding the space of possible attribute-value pairs. The proposed algorithms are realized in the Apache Storm stream processing framework. Through extensive experiments with real-world as well as synthetic data, we evaluate the proposed algorithms and show that they outperform competing approaches.

在这项工作中，我们考虑在不遵循特定模式的大量JSON文档流上计算自然连接。我们首先提出了一种高效且可扩展的分区算法，该算法使用关联分析的主要原则来识别文档中属性-值对共现的模式。然后，数据相应地转发到计算节点，并使用一种新的基于fp树的连接算法进行本地连接。通过紧凑地存储文档和有效地遍历fp -树结构，所提出的连接算法可以在大的输入大小上操作并实时提供结果。我们讨论了无模式数据的自然连接所固有的与数据相关的可伸缩性限制，并展示了如何通过人为地扩展可能的属性值对的空间来实际规避这些限制。该算法在Apache Storm流处理框架中实现。通过对真实世界和合成数据的广泛实验，我们评估了所提出的算法，并表明它们优于竞争方法。

引用次数: 0

Efficient Locality-Sensitive Hashing Over High-Dimensional Data Streams 高维数据流上高效的位置敏感哈希

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00220

Chengcheng Yang, Dong Deng, Shuo Shang, Ling Shao

Approximate Nearest Neighbor (ANN) search in high-dimensional space is a fundamental task in many applications. Locality-Sensitive Hashing (LSH) is a well-known methodology to solve the ANN problem with theoretical guarantees and empirical performance. We observe that existing LSH-based approaches target at the problem of designing search optimized indexes, which require a number of separate indexes and high index maintenance overhead, and hence impractical for high-dimensional streaming data processing. In this paper, we present PDA-LSH, a novel and practical disk-based LSH index that can offer efficient support for both updates and searches. Experiments on real-world datasets show that our proposal outperforms the state-of-the-art schemes by up to 10× on update performance and up to 2× on search performance.

高维空间中的近似最近邻(ANN)搜索是许多应用中的一项基本任务。位置敏感哈希(LSH)是一种著名的解决人工神经网络问题的方法，具有理论保证和经验性能。我们观察到，现有的基于lsh的方法针对的是设计搜索优化索引的问题，这需要许多单独的索引和高索引维护开销，因此不适合高维流数据处理。在本文中，我们提出了一种新颖实用的基于磁盘的LSH索引PDA-LSH，它可以为更新和搜索提供有效的支持。在真实数据集上的实验表明，我们的建议在更新性能上比最先进的方案高出10倍，在搜索性能上高出2倍。

引用次数: 5

User-driven Error Detection for Time Series with Events 带有事件的时间序列的用户驱动错误检测

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00070

Kim-Hung Le, Paolo Papotti

Anomalies are pervasive in time series data, such as sensor readings. Existing methods for anomaly detection cannot distinguish between anomalies that represent data errors, such as incorrect sensor readings, and notable events, such as the watering action in soil monitoring. In addition, the quality performance of such detection methods highly depends on the configuration parameters, which are dataset specific. In this work, we exploit active learning to detect both errors and events in a single solution that aims at minimizing user interaction. For this joint detection, we introduce an algorithm that accurately detects and labels anomalies with a non-parametric concept of neighborhood and probabilistic classification. Given a desired quality, the confidence of the classification is then used as termination condition for the active learning algorithm. Experiments on real and synthetic datasets demonstrate that our approach achieves F-score above 80% in detecting errors by labeling 2 to 5 points in one data series. We also show the superiority of our solution compared to the state-of-the-art approaches for anomaly detection. Finally, we demonstrate the positive impact of our error detection methods in downstream data repairing algorithms.

异常在时间序列数据中是普遍存在的，比如传感器读数。现有的异常检测方法无法区分代表数据错误的异常(如不正确的传感器读数)和值得注意的事件(如土壤监测中的浇水动作)。此外，这些检测方法的质量性能高度依赖于配置参数，这些参数是特定于数据集的。在这项工作中，我们利用主动学习来检测单个解决方案中的错误和事件，旨在最大限度地减少用户交互。对于这种联合检测，我们引入了一种利用邻域和概率分类的非参数概念准确检测和标记异常的算法。给定期望的质量，然后将分类的置信度用作主动学习算法的终止条件。在真实数据集和合成数据集上的实验表明，我们的方法在一个数据序列中标记2到5个点，在检测错误方面达到了80%以上的f分。我们还展示了与最先进的异常检测方法相比，我们的解决方案的优越性。最后，我们展示了我们的错误检测方法在下游数据修复算法中的积极影响。

{"title":"User-driven Error Detection for Time Series with Events","authors":"Kim-Hung Le, Paolo Papotti","doi":"10.1109/ICDE48307.2020.00070","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00070","url":null,"abstract":"Anomalies are pervasive in time series data, such as sensor readings. Existing methods for anomaly detection cannot distinguish between anomalies that represent data errors, such as incorrect sensor readings, and notable events, such as the watering action in soil monitoring. In addition, the quality performance of such detection methods highly depends on the configuration parameters, which are dataset specific. In this work, we exploit active learning to detect both errors and events in a single solution that aims at minimizing user interaction. For this joint detection, we introduce an algorithm that accurately detects and labels anomalies with a non-parametric concept of neighborhood and probabilistic classification. Given a desired quality, the confidence of the classification is then used as termination condition for the active learning algorithm. Experiments on real and synthetic datasets demonstrate that our approach achieves F-score above 80% in detecting errors by labeling 2 to 5 points in one data series. We also show the superiority of our solution compared to the state-of-the-art approaches for anomaly detection. Finally, we demonstrate the positive impact of our error detection methods in downstream data repairing algorithms.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"59 1","pages":"745-757"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88492366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

G-thinker: A Distributed Framework for Mining Subgraphs in a Big Graph G-thinker:在大图中挖掘子图的分布式框架

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00122

Da Yan, Guimu Guo, Md Mashiur Rahman Chowdhury, M. Tamer Özsu, Wei-Shinn Ku, John C.S. Lui

Mining from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thinker that adopts a user-friendly subgraph-centric vertex-pulling API for writing distributed subgraph mining algorithms. To utilize all CPU cores of a cluster, G-thinker features (1) a highly-concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the CPU idle time. Extensive experiments demonstrate that G-thinker achieves orders of magnitude speedup compared even with the fastest existing subgraph-centric system, and it scales well to much larger and denser real network data. G-thinker is open-sourced at http://bit.ly/gthinker with detailed documentation.

从大图中挖掘满足一定条件的子图在社区检测和子图匹配等许多应用中都很有用。这些问题具有很高的时间复杂度，但现有的扩展系统在执行时都是io限制的。我们提出了第一个真正的cpu绑定分布式框架G-thinker，它采用了一个用户友好的以子图为中心的顶点抽取API来编写分布式子图挖掘算法。为了利用集群的所有CPU内核，G-thinker具有(1)用于并行任务访问的高并发顶点缓存和(2)确保高任务吞吐量的轻量级任务调度方法。这些设计很好地将通信与计算重叠，以最大限度地减少CPU空闲时间。大量的实验表明，与现有最快的以子图为中心的系统相比，G-thinker实现了数量级的加速，并且它可以很好地扩展到更大、更密集的真实网络数据。G-thinker在http://bit.ly/gthinker上开源，并提供详细的文档。

引用次数: 29

Preserving Contextual Information in Relational Matrix Operations 在关系矩阵操作中保存上下文信息

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00197

O. Dolmatova, Nikolaus Augsten, Michael H. Böhlen

There exist large amounts of numerical data that are stored in databases and must be analyzed. Database tables come with a schema and include non-numerical attributes; this is crucial contextual information that is needed for interpreting the numerical values. We propose relational matrix operations that support the analysis of data stored in tables and that preserve contextual information. The result of our approach are precisely defined relational matrix operations and a system implementation in MonetDB that illustrates the seamless integration of relational matrix operations into a relational DBMS.

有大量的数值数据存储在数据库中，必须进行分析。数据库表带有一个模式并包含非数字属性;这是解释数值所需的关键上下文信息。我们建议使用关系矩阵操作，支持对存储在表中的数据进行分析，并保留上下文信息。我们的方法的结果是精确定义的关系矩阵操作和MonetDB中的系统实现，演示了关系矩阵操作与关系DBMS的无缝集成。

引用次数: 4

Doubleheader Logging: Eliminating Journal Write Overhead for Mobile DBMS 双头日志记录:消除移动DBMS的日志写开销

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00111

Sehyeon Oh, Wook-Hee Kim, Jihye Seo, Hyeonho Song, S. Noh, Beomseok Nam

Various transactional systems use out-of-place up-dates such as logging or copy-on-write mechanisms to update data in a failure-atomic manner. Such out-of-place update methods double the I/O traffic due to back-up copies in the database layer and quadruple the I/O traffic due to the file system journaling. In mobile systems, transaction sizes of mobile apps are known to be tiny and transactions run at low concurrency. For such mobile transactions, legacy out-of-place update methods such as WAL are sub-optimal. In this work, we propose a crash consistent in-place update logging method - doubleheader logging (DHL) for SQLite. DHL prevents previous consistent records from being lost by performing a copy-on-write inside the database page and co-locating the metadata-only journal information within the page. This is done, in turn, with minimal sacrifice to page utilization. DHL is similar to when journaling is disabled, in the sense that it incurs almost no additional overhead in terms of both I/O and computation. Our experimental results show that DHL outperforms other logging methods such as out-of-place update write-ahead logging (WAL) and in-place update multi-version B-tree (MVBT).

各种事务性系统使用非位置更新(例如日志记录或写时复制机制)以故障原子方式更新数据。由于数据库层中的备份副本，这种异地更新方法使I/O流量增加了一倍，由于文件系统日志记录，使I/O流量增加了四倍。在移动系统中，移动应用程序的事务大小很小，并且事务以低并发性运行。对于这样的移动事务，遗留的异地更新方法(如WAL)不是最优的。在这项工作中，我们提出了一种崩溃一致的就地更新日志记录方法——双报头日志记录(DHL)。DHL通过在数据库页内执行写时复制(copy-on-write)并在页内共同定位仅包含元数据的日志信息，防止了以前的一致记录丢失。这样做，反过来，以最小的牺牲页面利用率。DHL类似于禁用日志记录时的情况，因为它在I/O和计算方面几乎不会产生额外的开销。我们的实验结果表明，DHL优于其他日志记录方法，如不在位置更新提前写日志(WAL)和在位置更新多版本b树(MVBT)。

{"title":"Doubleheader Logging: Eliminating Journal Write Overhead for Mobile DBMS","authors":"Sehyeon Oh, Wook-Hee Kim, Jihye Seo, Hyeonho Song, S. Noh, Beomseok Nam","doi":"10.1109/ICDE48307.2020.00111","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00111","url":null,"abstract":"Various transactional systems use out-of-place up-dates such as logging or copy-on-write mechanisms to update data in a failure-atomic manner. Such out-of-place update methods double the I/O traffic due to back-up copies in the database layer and quadruple the I/O traffic due to the file system journaling. In mobile systems, transaction sizes of mobile apps are known to be tiny and transactions run at low concurrency. For such mobile transactions, legacy out-of-place update methods such as WAL are sub-optimal. In this work, we propose a crash consistent in-place update logging method - doubleheader logging (DHL) for SQLite. DHL prevents previous consistent records from being lost by performing a copy-on-write inside the database page and co-locating the metadata-only journal information within the page. This is done, in turn, with minimal sacrifice to page utilization. DHL is similar to when journaling is disabled, in the sense that it incurs almost no additional overhead in terms of both I/O and computation. Our experimental results show that DHL outperforms other logging methods such as out-of-place update write-ahead logging (WAL) and in-place update multi-version B-tree (MVBT).","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"35 1","pages":"1237-1248"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73548468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

SLED: Semi-supervised Locally-weighted Ensemble Detector 半监督局部加权集合检测器

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/icde48307.2020.00183

Shuxiang Zhang, David Tse Jung Huang, G. Dobbie, Yun Sing Koh

Concept drift detection refers to the process of detecting changes in the underlying distribution of data. Interest in the data stream mining community has increased, because of their role in improving the performance of online learning algorithms. Over the years, a myriad of drift detection methods have been proposed. However, most of these methods are single detectors, which usually work well only with a single type of drift. In this research, we propose a semi-supervised locally-weighted ensemble detector (SLED), where the relative performance among its base detectors is characterized by a set of weights learned in a semi-supervised manner. The aim of this technique is to effectively deal with both abrupt and gradual concept drifts. In our experiments, SLED is configured with ten well-known drift detectors. To evaluate the performance of SLED, we compare it with single detectors as well as state-of-the-art ensemble methods on both synthetic and real-world datasets using different performance measures. The experimental results show that SLED has fewer false positives, higher precision, and higher Matthews correlation coefficient while maintaining reasonably good performance for other measures.

概念漂移检测是指检测数据底层分布变化的过程。由于数据流挖掘在提高在线学习算法性能方面的作用，人们对数据流挖掘社区的兴趣越来越大。多年来，已经提出了无数的漂移检测方法。然而，这些方法大多是单探测器，通常只适用于单一类型的漂移。在本研究中，我们提出了一种半监督的局部加权集成检测器(SLED)，其基本检测器之间的相对性能由一组以半监督方式学习的权值来表征。这种技术的目的是有效地处理突然和渐进的概念漂移。在我们的实验中，SLED配置了十个知名的漂移检测器。为了评估SLED的性能，我们将其与单个检测器以及使用不同性能度量的最先进的集成方法在合成和实际数据集上进行了比较。实验结果表明，SLED具有更少的假阳性，更高的精度和更高的Matthews相关系数，同时在其他度量中保持了相当好的性能。

{"title":"SLED: Semi-supervised Locally-weighted Ensemble Detector","authors":"Shuxiang Zhang, David Tse Jung Huang, G. Dobbie, Yun Sing Koh","doi":"10.1109/icde48307.2020.00183","DOIUrl":"https://doi.org/10.1109/icde48307.2020.00183","url":null,"abstract":"Concept drift detection refers to the process of detecting changes in the underlying distribution of data. Interest in the data stream mining community has increased, because of their role in improving the performance of online learning algorithms. Over the years, a myriad of drift detection methods have been proposed. However, most of these methods are single detectors, which usually work well only with a single type of drift. In this research, we propose a semi-supervised locally-weighted ensemble detector (SLED), where the relative performance among its base detectors is characterized by a set of weights learned in a semi-supervised manner. The aim of this technique is to effectively deal with both abrupt and gradual concept drifts. In our experiments, SLED is configured with ten well-known drift detectors. To evaluate the performance of SLED, we compare it with single detectors as well as state-of-the-art ensemble methods on both synthetic and real-world datasets using different performance measures. The experimental results show that SLED has fewer false positives, higher precision, and higher Matthews correlation coefficient while maintaining reasonably good performance for other measures.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"5 1","pages":"1838-1841"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76596511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

ML-based Cross-Platform Query Optimization 基于ml的跨平台查询优化

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00132

Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, S. Chawla

Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned. In the era of machine learning (ML), the first step to remedy this problem is to replace the cost model of the optimizer with an ML model. However, such a solution brings in two major challenges. First, the optimizer has to transform a query plan to a vector million times during plan enumeration incurring a very high overhead. Second, a lot of training data is required to effectively train the ML model. We overcome these challenges in Robopt, a novel vector-based optimizer we have built for Rheem, a cross-platform system. Robopt not only uses an ML model to prune the search space but also bases the entire plan enumeration on a set of algebraic operations that operate on vectors, which are a natural fit to the ML model. This leads to both speed-up and scale-up of the enumeration process by exploiting modern CPUs via vectorization. We also accompany Robopt with a scalable training data generator for building its ML model. Our evaluation shows that (i) the vector-based approach is more efficient and scalable than simply using an ML model and (ii) Robopt matches and, in some cases, improves Rheem’s cost-based optimizer in choosing good plans without requiring any tuning effort.

众所周知，基于成本的优化有一个主要缺点:管理员花费大量时间来调优相关的成本模型。这个问题只会在跨平台设置中变得更加严重，因为有更多的参数需要调整。在机器学习(ML)时代，解决这个问题的第一步是用ML模型取代优化器的成本模型。然而，这样的解决方案带来了两个主要挑战。首先，优化器必须在计划枚举期间将查询计划转换为向量百万次，从而产生非常高的开销。其次，为了有效地训练ML模型，需要大量的训练数据。我们在Robopt中克服了这些挑战，Robopt是我们为跨平台系统Rheem构建的一种新颖的基于矢量的优化器。Robopt不仅使用ML模型来修剪搜索空间，而且还将整个计划枚举建立在一组对向量进行操作的代数操作的基础上，这与ML模型非常适合。这通过向量化来利用现代cpu，从而导致枚举过程的加速和扩展。我们还为Robopt提供了一个可扩展的训练数据生成器，用于构建其ML模型。我们的评估表明:(i)基于向量的方法比简单地使用ML模型更有效和可扩展;(ii) Robopt匹配，并且在某些情况下，改进了Rheem的基于成本的优化器，可以在不需要任何调整的情况下选择良好的计划。

{"title":"ML-based Cross-Platform Query Optimization","authors":"Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, S. Chawla","doi":"10.1109/ICDE48307.2020.00132","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00132","url":null,"abstract":"Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned. In the era of machine learning (ML), the first step to remedy this problem is to replace the cost model of the optimizer with an ML model. However, such a solution brings in two major challenges. First, the optimizer has to transform a query plan to a vector million times during plan enumeration incurring a very high overhead. Second, a lot of training data is required to effectively train the ML model. We overcome these challenges in Robopt, a novel vector-based optimizer we have built for Rheem, a cross-platform system. Robopt not only uses an ML model to prune the search space but also bases the entire plan enumeration on a set of algebraic operations that operate on vectors, which are a natural fit to the ML model. This leads to both speed-up and scale-up of the enumeration process by exploiting modern CPUs via vectorization. We also accompany Robopt with a scalable training data generator for building its ML model. Our evaluation shows that (i) the vector-based approach is more efficient and scalable than simply using an ML model and (ii) Robopt matches and, in some cases, improves Rheem’s cost-based optimizer in choosing good plans without requiring any tuning effort.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"178 1","pages":"1489-1500"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76851090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

ForkBase: Immutable, Tamper-evident Storage Substrate for Branchable Applications ForkBase:用于可分支应用程序的不可变、防篡改的存储基板

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00153

Qian Lin, Kaiyuan Yang, Tien Tuan Anh Dinh, Qingchao Cai, Gang Chen, B. Ooi, Pingcheng Ruan, Sheng Wang, Zhongle Xie, Meihui Zhang, Olafs Vandans

Data collaboration activities typically require systematic or protocol-based coordination to be scalable. Git, an effective enabler for collaborative coding, has been attested for its success in countless projects around the world. Hence, applying the Git philosophy to general data collaboration beyond coding is motivating. We call it Git for data. However, the original Git design handles data at the file granule, which is considered too coarse-grained for many database applications. We argue that Git for data should be co-designed with database systems. To this end, we developed ForkBase to make Git for data practical. ForkBase is a distributed, immutable storage system designed for data version management and data collaborative operation. In this demonstration, we show how ForkBase can greatly facilitate collaborative data management and how its novel data deduplication technique can improve storage efficiency for archiving massive data versions.

数据协作活动通常需要系统的或基于协议的协调才能伸缩。Git是协作编码的有效推动者，它在世界各地无数的项目中取得了成功。因此，将Git理念应用于编码之外的一般数据协作是一种激励。我们称它为Git，代表数据。然而，最初的Git设计在文件颗粒上处理数据，对于许多数据库应用程序来说，这被认为过于粗粒度。我们认为用于数据的Git应该与数据库系统共同设计。为此，我们开发了ForkBase来实现Git的数据实用。ForkBase是为数据版本管理和数据协同操作而设计的分布式、不可变存储系统。在本演示中，我们将展示ForkBase如何极大地促进协同数据管理，以及其新颖的重复数据删除技术如何提高海量数据版本归档的存储效率。

引用次数: 6

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE 36th International Conference on Data Engineering (ICDE)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀