Proceedings of the 2018 International Conference on Management of Data最新文献

英文中文

Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging 陀思妥耶夫斯基:通过自适应去除多余合并，为基于lsm树的键值存储提供更好的时空权衡

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196927

Niv Dayan, Stratos Idreos

In this paper, we show that all mainstream LSM-tree based key-value stores in the literature and in industry are suboptimal with respect to how they trade off among the I/O costs of updates, point lookups, range lookups, as well as the cost of storage, measured as space-amplification. The reason is that they perform expensive merge operations in order to (1) bound the number of runs that a lookup has to probe, and to (2) remove obsolete entries to reclaim space. However, most of these merge operations reduce point lookup cost, long range lookup cost, and space-amplification by a negligible amount. To address this problem, we expand the LSM-tree design space with Lazy Leveling, a new design that prohibits merge operations at all levels of LSM-tree but the largest. We show that Lazy Leveling improves the worst-case cost complexity of updates while maintaining the same bounds on point lookup cost, long range lookup cost, and space-amplification. To be able to navigate between Lazy Leveling and other designs, we make the LSM-tree design space fluid by introducing Fluid LSM-tree, a generalization of LSM-tree that can be parameterized to assume all existing LSM-tree designs. We show how to fluidly transition from Lazy Leveling to (1) designs that are more optimized for updates by merging less at the largest level, and (2) designs that are more optimized for small range lookups by merging more at all other levels. We put everything together to design Dostoevsky, a key-value store that navigates the entire Fluid LSM-tree design space based on the application workload and hardware to maximize throughput using a novel closed-form performance model. We implemented Dostoevsky on top of RocksDB, and we show that it strictly dominates state-of-the-art LSM-tree based key-value stores in terms of performance and space-amplification.

在本文中，我们展示了所有主流的基于lsm树的键值存储在文献和行业中都是次优的，就它们如何权衡更新、点查找、范围查找的I/O成本以及存储成本(以空间放大来衡量)而言。原因是它们执行昂贵的合并操作，以便(1)限制查找必须探测的运行次数，以及(2)删除过时的项以回收空间。然而，这些合并操作中的大多数减少了点查找成本、远程查找成本和空间放大，这可以忽略不计。为了解决这个问题，我们使用Lazy levellevel扩展了lsm树的设计空间，这是一种新的设计，禁止在lsm树的所有级别进行合并操作，但最大的级别除外。我们表明，延迟调平提高了更新的最坏情况成本复杂性，同时保持了点查找成本、远程查找成本和空间放大的相同界限。为了能够在延迟调平和其他设计之间进行导航，我们引入了流体LSM-tree，使LSM-tree设计空间具有流动性。流体LSM-tree是LSM-tree的一种推广，可以参数化以假设所有现有的LSM-tree设计。我们展示了如何从Lazy levellevel流畅地过渡到(1)通过在最大级别合并更少来优化更新的设计，以及(2)通过在所有其他级别合并更多来优化小范围查找的设计。我们把所有东西放在一起设计了Dostoevsky，这是一个键值存储，可以根据应用程序工作负载和硬件导航整个Fluid lsm树设计空间，从而使用一种新颖的封闭式性能模型最大化吞吐量。我们在RocksDB之上实现了Dostoevsky，我们证明了它在性能和空间放大方面严格地支配着最先进的基于lsm树的键值存储。

{"title":"Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging","authors":"Niv Dayan, Stratos Idreos","doi":"10.1145/3183713.3196927","DOIUrl":"https://doi.org/10.1145/3183713.3196927","url":null,"abstract":"In this paper, we show that all mainstream LSM-tree based key-value stores in the literature and in industry are suboptimal with respect to how they trade off among the I/O costs of updates, point lookups, range lookups, as well as the cost of storage, measured as space-amplification. The reason is that they perform expensive merge operations in order to (1) bound the number of runs that a lookup has to probe, and to (2) remove obsolete entries to reclaim space. However, most of these merge operations reduce point lookup cost, long range lookup cost, and space-amplification by a negligible amount. To address this problem, we expand the LSM-tree design space with Lazy Leveling, a new design that prohibits merge operations at all levels of LSM-tree but the largest. We show that Lazy Leveling improves the worst-case cost complexity of updates while maintaining the same bounds on point lookup cost, long range lookup cost, and space-amplification. To be able to navigate between Lazy Leveling and other designs, we make the LSM-tree design space fluid by introducing Fluid LSM-tree, a generalization of LSM-tree that can be parameterized to assume all existing LSM-tree designs. We show how to fluidly transition from Lazy Leveling to (1) designs that are more optimized for updates by merging less at the largest level, and (2) designs that are more optimized for small range lookups by merging more at all other levels. We put everything together to design Dostoevsky, a key-value store that navigates the entire Fluid LSM-tree design space based on the application workload and hardware to maximize throughput using a novel closed-form performance model. We implemented Dostoevsky on top of RocksDB, and we show that it strictly dominates state-of-the-art LSM-tree based key-value stores in terms of performance and space-amplification.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85952475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 123

FREDDY: Fast Word Embeddings in Database Systems 数据库系统中的快速单词嵌入

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183717

Michael Günther

引用次数: 21

Catching Numeric Inconsistencies in Graphs 捕捉图形中的数字不一致性

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183753

W. Fan, Xueli Liu, Ping Lu, Chao Tian

Numeric inconsistencies are common in real-life knowledge bases and social networks. To catch such errors, we propose to extend graph functional dependencies with linear arithmetic expressions and comparison predicates, referred to as NGDs. We study fundamental problems for NGDs. We show that their satisfiability, implication and validation problems are Σ 2 p-complete, ¶II2 p-complete and coNP-complete, respectively. However, if we allow non-linear arithmetic expressions, even of degree at most 2, the satisfiability and implication problems become undecidable. In other words, NGDs strike a balance between expressivity and complexity. To make practical use of NGDs, we develop an incremental algorithm IncDect to detect errors in a graph G using NGDs, in response to updates Δ G to G. We show that the incremental validation problem is coNP-complete. Nonetheless, algorithm IncDect is localizable, i.e., its cost is determined by small neighbors of nodes in Δ G instead of the entire G. Moreover, we parallelize IncDect such that it guarantees to reduce running time with the increase of processors. Using real-life and synthetic graphs, we experimentally verify the scalability and efficiency of the algorithms.

数字不一致在现实生活中的知识库和社交网络中很常见。为了捕获此类错误，我们建议使用线性算术表达式和比较谓词(称为ngd)扩展图函数依赖关系。我们研究NGDs的基本问题。我们证明了它们的可满足性、蕴涵和验证问题分别是Σ 2 p-完备、¶II2 p-完备和conp -完备。然而，如果我们允许非线性算术表达式，即使最多为2次，可满足性和蕴涵问题就变得不可确定。换句话说，ngd在表达性和复杂性之间取得了平衡。为了实际使用ngd，我们开发了一种增量算法IncDect，使用ngd来检测图G中的错误，以响应Δ G到G的更新。我们证明了增量验证问题是conp完全的。尽管如此，IncDect算法是可本地化的，即它的成本是由Δ G中节点的小邻居决定的，而不是整个G。此外，我们将IncDect并行化，以保证随着处理器的增加而减少运行时间。利用真实图和合成图，实验验证了算法的可扩展性和效率。

{"title":"Catching Numeric Inconsistencies in Graphs","authors":"W. Fan, Xueli Liu, Ping Lu, Chao Tian","doi":"10.1145/3183713.3183753","DOIUrl":"https://doi.org/10.1145/3183713.3183753","url":null,"abstract":"Numeric inconsistencies are common in real-life knowledge bases and social networks. To catch such errors, we propose to extend graph functional dependencies with linear arithmetic expressions and comparison predicates, referred to as NGDs. We study fundamental problems for NGDs. We show that their satisfiability, implication and validation problems are Σ 2 p-complete, ¶II2 p-complete and coNP-complete, respectively. However, if we allow non-linear arithmetic expressions, even of degree at most 2, the satisfiability and implication problems become undecidable. In other words, NGDs strike a balance between expressivity and complexity. To make practical use of NGDs, we develop an incremental algorithm IncDect to detect errors in a graph G using NGDs, in response to updates Δ G to G. We show that the incremental validation problem is coNP-complete. Nonetheless, algorithm IncDect is localizable, i.e., its cost is determined by small neighbors of nodes in Δ G instead of the entire G. Moreover, we parallelize IncDect such that it guarantees to reduce running time with the increase of processors. Using real-life and synthetic graphs, we experimentally verify the scalability and efficiency of the algorithms.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89736931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

BIPie: Fast Selection and Aggregation on Encoded Data using Operator Specialization BIPie:利用算子专门化对编码数据进行快速选择和聚合

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3190658

Michal Nowakiewicz, E. Boutin, E. Hanson, R. Walzer, Akash Katipally

Advances in modern hardware, such as increases in the size of main memory available on computers, have made it possible to analyze data at a much higher rate than before. In this paper, we demonstrate that there is tremendous room for improvement in the processing of analytical queries on modern commodity hardware. We introduce BIPie, an engine for query processing implementing highly efficient decoding, selection, and aggregation for analytical queries executing on a columnar storage engine in MemSQL. We demonstrate that these operations are interdependent, and must be fused and considered together to achieve very high performance. We propose and compare multiple strategies for decoding, selection and aggregation (with GROUP BY), all of which are designed to take advantage of modern CPU architectures, including SIMD. We implemented these approaches in MemSQL, a high performance hybrid transaction and analytical processing database designed for commodity hardware. We thoroughly evaluate the performance of the approach across a range of parameters, and demonstrate a two to four times speedup over previously published TPC-H Query 1 performance.

现代硬件的进步，例如计算机上可用的主存储器的大小的增加，使得以比以前高得多的速度分析数据成为可能。在本文中，我们证明了在现代商品硬件上分析查询的处理有巨大的改进空间。我们介绍BIPie，一个用于查询处理的引擎，它实现了在MemSQL的列存储引擎上执行的分析查询的高效解码、选择和聚合。我们证明了这些操作是相互依赖的，必须融合并考虑在一起以实现非常高的性能。我们提出并比较了解码、选择和聚合(与GROUP BY)的多种策略，所有这些策略都旨在利用现代CPU架构，包括SIMD。我们在MemSQL中实现了这些方法，MemSQL是为商用硬件设计的高性能混合事务和分析处理数据库。我们在一系列参数上全面评估了该方法的性能，并演示了比以前发布的TPC-H Query 1性能提高2到4倍的速度。

{"title":"BIPie: Fast Selection and Aggregation on Encoded Data using Operator Specialization","authors":"Michal Nowakiewicz, E. Boutin, E. Hanson, R. Walzer, Akash Katipally","doi":"10.1145/3183713.3190658","DOIUrl":"https://doi.org/10.1145/3183713.3190658","url":null,"abstract":"Advances in modern hardware, such as increases in the size of main memory available on computers, have made it possible to analyze data at a much higher rate than before. In this paper, we demonstrate that there is tremendous room for improvement in the processing of analytical queries on modern commodity hardware. We introduce BIPie, an engine for query processing implementing highly efficient decoding, selection, and aggregation for analytical queries executing on a columnar storage engine in MemSQL. We demonstrate that these operations are interdependent, and must be fused and considered together to achieve very high performance. We propose and compare multiple strategies for decoding, selection and aggregation (with GROUP BY), all of which are designed to take advantage of modern CPU architectures, including SIMD. We implemented these approaches in MemSQL, a high performance hybrid transaction and analytical processing database designed for commodity hardware. We thoroughly evaluate the performance of the approach across a range of parameters, and demonstrate a two to four times speedup over previously published TPC-H Query 1 performance.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84739920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Trip Planning by an Integrated Search Paradigm 综合搜索模式下的旅行计划

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193543

Sheng Wang, Mingzhao Li, Yipeng Zhang, Z. Bao, David Alexander Tedjopurnomo, X. Qin

In this paper, we build a trip planning system called TISP, which enables user's interactive exploration of POIs and trajectories in their incremental trip planning. At the back end, TISP is able to support seven types of common queries over spatial-only, spatial-textual and textual-only data, based on our proposed unified indexing and search paradigm [7]. At the front end, we propose novel visualisation designs to present the result of different types of queries; our user-friendly interaction designs allow users to construct further queries without inputting any text.

在本文中，我们建立了一个名为TISP的旅行计划系统，使用户能够在增量旅行计划中交互式地探索poi和轨迹。在后端，基于我们提出的统一索引和搜索范式[7]，TISP能够支持对纯空间、空间文本和纯文本数据的七种常见查询。在前端，我们提出了新颖的可视化设计来呈现不同类型查询的结果;我们的用户友好交互设计允许用户在不输入任何文本的情况下构建进一步的查询。

引用次数: 10

Session details: Industry 4: Graph databases & Query Processing on Modern Hardware 行业4:现代硬件上的图形数据库和查询处理

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3258021

Jianjun Chen

引用次数: 0

Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation 列草图:快速鲁棒谓词评估的扫描加速器

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196911

Brian Hentschel, Michael S. Kester, Stratos Idreos

While numerous indexing and storage schemes have been developed to address the core functionality of predicate evaluation in data systems, they all require specific workload properties (query selectivity, data distribution, data clustering) to provide good performance and fail in other cases. We present a new class of indexing scheme, termed a Column Sketch, which improves the performance of predicate evaluation independently of workload properties. Column Sketches work primarily through the use of lossy compression schemes which are designed so that the index ingests data quickly, evaluates any query performantly, and has small memory footprint. A Column Sketch works by applying this lossy compression on a value-by-value basis, mapping base data to a representation of smaller fixed width codes. Queries are evaluated affirmatively or negatively for the vast majority of values using the compressed data, and only if needed check the base data for the remaining values. Column Sketches work over column, row, and hybrid storage layouts. We demonstrate that by using a Column Sketch, the select operator in modern analytic systems attains better CPU efficiency and less data movement than state-of-the-art storage and indexing schemes. Compared to standard scans, Column Sketches provide an improvement of 3x-6x for numerical attributes and 2.7x for categorical attributes. Compared to state-of-the-art scan accelerators such as Column Imprints and BitWeaving, Column Sketches perform 1.4 - 4.8× better.

虽然已经开发了许多索引和存储方案来解决数据系统中谓词计算的核心功能，但它们都需要特定的工作负载属性(查询选择性、数据分布、数据聚类)来提供良好的性能，而在其他情况下则会失败。我们提出了一类新的索引方案，称为列草图，它提高了独立于工作负载属性的谓词评估性能。Column sketch主要通过使用有损压缩方案来工作，这种压缩方案的设计使得索引能够快速地获取数据，高效地评估任何查询，并且内存占用很小。Column Sketch的工作原理是在逐个值的基础上应用这种有损压缩，将基本数据映射到较小的固定宽度代码的表示。对于使用压缩数据的绝大多数值，查询会以肯定或否定的方式进行评估，只有在需要时才会检查基本数据中的剩余值。列草图可用于列、行和混合存储布局。我们证明，通过使用Column Sketch，现代分析系统中的select操作符比最先进的存储和索引方案获得更好的CPU效率和更少的数据移动。与标准扫描相比，Column sketch在数字属性方面提高了3 -6倍，在分类属性方面提高了2.7倍。与最先进的扫描加速器(如Column Imprints和BitWeaving)相比，Column sketch的性能要好1.4 - 4.8倍。

{"title":"Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation","authors":"Brian Hentschel, Michael S. Kester, Stratos Idreos","doi":"10.1145/3183713.3196911","DOIUrl":"https://doi.org/10.1145/3183713.3196911","url":null,"abstract":"While numerous indexing and storage schemes have been developed to address the core functionality of predicate evaluation in data systems, they all require specific workload properties (query selectivity, data distribution, data clustering) to provide good performance and fail in other cases. We present a new class of indexing scheme, termed a Column Sketch, which improves the performance of predicate evaluation independently of workload properties. Column Sketches work primarily through the use of lossy compression schemes which are designed so that the index ingests data quickly, evaluates any query performantly, and has small memory footprint. A Column Sketch works by applying this lossy compression on a value-by-value basis, mapping base data to a representation of smaller fixed width codes. Queries are evaluated affirmatively or negatively for the vast majority of values using the compressed data, and only if needed check the base data for the remaining values. Column Sketches work over column, row, and hybrid storage layouts. We demonstrate that by using a Column Sketch, the select operator in modern analytic systems attains better CPU efficiency and less data movement than state-of-the-art storage and indexing schemes. Compared to standard scans, Column Sketches provide an improvement of 3x-6x for numerical attributes and 2.7x for categorical attributes. Compared to state-of-the-art scan accelerators such as Column Imprints and BitWeaving, Column Sketches perform 1.4 - 4.8× better.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81908275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Eon Mode: Bringing the Vertica Columnar Database to the Cloud Eon模式:将Vertica柱状数据库带到云端

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196938

Ben Vandiver, S. Prasad, Pratibha Rana, Eden Zik, Amin Saeidi, Pratyush Parimal, Styliani Pantela, J. Dave

The Vertica Analytic Database is a powerful tool for high performance, large scale SQL analytics. Historically, Vertica has managed direct-attached disk for performance and reliability, at a cost of product complexity and scalability. Eon mode is a new architecture for Vertica that places the data on a reliable shared storage, matching the original architecture's performance on existing workloads and supporting new workloads. While the design reuses Vertica's optimizer and execution engine, the metadata, storage, and fault tolerance mechanisms are re-architected to enable and take advantage of shared storage. A sharding mechanism distributes load over the nodes while retaining the capability of running node-local table joins. Running on Amazon EC2 compute and S3 storage, Eon mode demonstrates good performance, superior scalability, and robust operational behavior. With these improvements, Vertica delivers on the promise of cloud economics, consuming only the compute and storage resources needed, while supporting efficient elasticity.

Vertica分析数据库是高性能、大规模SQL分析的强大工具。从历史上看，Vertica一直以牺牲产品复杂性和可扩展性为代价来管理直连磁盘的性能和可靠性。Eon模式是Vertica的一种新架构，它将数据放在可靠的共享存储上，在现有工作负载上匹配原始架构的性能，并支持新的工作负载。虽然该设计重用了Vertica的优化器和执行引擎，但元数据、存储和容错机制被重新架构，以启用和利用共享存储。分片机制将负载分配到节点上，同时保留运行节点本地表连接的能力。在Amazon EC2计算和S3存储上运行，Eon模式表现出良好的性能、卓越的可伸缩性和健壮的操作行为。通过这些改进，Vertica实现了云经济的承诺，只消耗所需的计算和存储资源，同时支持高效的弹性。

引用次数: 14

Worst Case Optimal Joins on Relational and XML data 最坏情况下关系和XML数据的最优连接

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183721

Yuxing Chen

In recent data management ecosystem, one of the greatest challenges is the data variety. Data varies in multiple formats such as relational and (semi-)structured data. Traditional database handles a single type of data format and thus its ability to deal with different types of data formats is limited. To overcome such limitation, we propose a multi-model processing framework for relational and semi-structured data (i.e. XML), and design a worst-case optimal join algorithm. The salient feature of our algorithm is that it can guarantee that the intermediate results are no larger than the worst-case join results. Preliminary results show that our multi-model algorithm significantly outperforms the baseline join methods in terms of running time and intermediate result size.

在最近的数据管理生态系统中，最大的挑战之一是数据的多样性。数据有多种格式，如关系数据和(半)结构化数据。传统数据库处理单一类型的数据格式，因此其处理不同类型数据格式的能力受到限制。为了克服这种限制，我们提出了一个多模型处理框架，用于关系和半结构化数据(即XML)，并设计了一个最坏情况最优连接算法。该算法的显著特点是可以保证中间结果不大于最坏情况下的连接结果。初步结果表明，我们的多模型算法在运行时间和中间结果大小方面明显优于基线连接方法。

引用次数: 3

Meta-Dataflows: Efficient Exploratory Dataflow Jobs 元数据流:高效的探索性数据流作业

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183760

R. Fernandez, W. Culhane, Pijika Watcharapichat, M. Weidlich, V. Morales, P. Pietzuch

Distributed dataflow systems such as Apache Spark and Apache Flink are used to derive new insights from large datasets. While they efficiently execute concrete data processing workflows, expressed as dataflow graphs, they lack generic support for exploratory workflows : if a user is uncertain about the correct processing pipeline, e.g. in terms of data cleaning strategy or choice of model parameters, they must repeatedly submit modified jobs to the system. This, however, misses out on optimisation opportunities for exploratory workflows, both in terms of scheduling and memory allocation. We describe meta-dataflows(MDFs), a new model to effectively express exploratory workflows and efficiently execute them on compute clusters. With MDFs, users specify a family of dataflows using two primitives: (a) an explore operator automatically considers choices in a dataflow; and (b) a choose operator assesses the result quality of explored dataflow branches and selects a subset of the results. We propose optimisations to execute MDFs: a system can (i) avoid redundant computation when exploring branches by reusing intermediate results, discarded results from underperforming branches, and pruning unnecessary branches; and (ii) consider future data access patterns in the MDF when allocating cluster memory. Our evaluation shows that MDFs improve the runtime of exploratory workflows by up to 90% compared to sequential job execution.

分布式数据流系统(如Apache Spark和Apache Flink)用于从大型数据集中获得新的见解。虽然它们可以有效地执行具体的数据处理工作流，用数据流图表示，但它们缺乏对探索性工作流的通用支持:如果用户不确定正确的处理管道，例如在数据清理策略或模型参数的选择方面，他们必须反复向系统提交修改后的作业。然而，在调度和内存分配方面，这错过了探索性工作流的优化机会。我们描述了元数据流(MDFs)，这是一种有效表达探索性工作流并在计算集群上高效执行的新模型。使用mdf，用户使用两个原语指定一系列数据流:(a)探索操作符自动考虑数据流中的选择;(b)选择算子评估所探索数据流分支的结果质量并选择结果的子集。我们建议优化执行mdf:系统可以(i)通过重用中间结果来避免在探索分支时的冗余计算，从表现不佳的分支中丢弃结果，并修剪不必要的分支;(ii)在分配集群内存时考虑MDF中未来的数据访问模式。我们的评估表明，与顺序作业执行相比，mdf可将探索性工作流的运行时间提高90%。

{"title":"Meta-Dataflows: Efficient Exploratory Dataflow Jobs","authors":"R. Fernandez, W. Culhane, Pijika Watcharapichat, M. Weidlich, V. Morales, P. Pietzuch","doi":"10.1145/3183713.3183760","DOIUrl":"https://doi.org/10.1145/3183713.3183760","url":null,"abstract":"Distributed dataflow systems such as Apache Spark and Apache Flink are used to derive new insights from large datasets. While they efficiently execute concrete data processing workflows, expressed as dataflow graphs, they lack generic support for exploratory workflows : if a user is uncertain about the correct processing pipeline, e.g. in terms of data cleaning strategy or choice of model parameters, they must repeatedly submit modified jobs to the system. This, however, misses out on optimisation opportunities for exploratory workflows, both in terms of scheduling and memory allocation. We describe meta-dataflows(MDFs), a new model to effectively express exploratory workflows and efficiently execute them on compute clusters. With MDFs, users specify a family of dataflows using two primitives: (a) an explore operator automatically considers choices in a dataflow; and (b) a choose operator assesses the result quality of explored dataflow branches and selects a subset of the results. We propose optimisations to execute MDFs: a system can (i) avoid redundant computation when exploring branches by reusing intermediate results, discarded results from underperforming branches, and pruning unnecessary branches; and (ii) consider future data access patterns in the MDF when allocating cluster memory. Our evaluation shows that MDFs improve the runtime of exploratory workflows by up to 90% compared to sequential job execution.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"376 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72547885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2018 International Conference on Management of Data

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀