Emma in Action: Declarative Dataflows for Scalable Data Analysis

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI:10.1145/2882903.2899396

Alexander B. Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, V. Markl

{"title":"Emma in Action: Declarative Dataflows for Scalable Data Analysis","authors":"Alexander B. Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, V. Markl","doi":"10.1145/2882903.2899396","DOIUrl":null,"url":null,"abstract":"Parallel dataflow APIs based on second-order functions were originally seen as a flexible alternative to SQL. Over time, however, their complexity increased due to the number of physical aspects that had to be exposed by the underlying engines in order to facilitate efficient execution. To retain a sufficient level of abstraction and lower the barrier of entry for data scientists, projects like Spark and Flink currently offer domain-specific APIs on top of their parallel collection abstractions. This demonstration highlights the benefits of an alternative design based on deep language embedding. We showcase Emma - a programming language embedded in Scala. Emma promotes parallel collection processing through native constructs like Scala's for-comprehensions - a declarative syntax akin to SQL. In addition, Emma also advocates quasi-quoting the entire data analysis algorithm rather than its individual dataflow expressions. This allows for decomposing the quoted code into (sequential) control flow and (parallel) dataflow fragments, optimizing the dataflows in context, and transparently offloading them to an engine like Spark or Flink. The proposed design promises increased programmer productivity due to avoiding an impedance mismatch, thereby reducing the lag times and cost of data analysis.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2899396","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Parallel dataflow APIs based on second-order functions were originally seen as a flexible alternative to SQL. Over time, however, their complexity increased due to the number of physical aspects that had to be exposed by the underlying engines in order to facilitate efficient execution. To retain a sufficient level of abstraction and lower the barrier of entry for data scientists, projects like Spark and Flink currently offer domain-specific APIs on top of their parallel collection abstractions. This demonstration highlights the benefits of an alternative design based on deep language embedding. We showcase Emma - a programming language embedded in Scala. Emma promotes parallel collection processing through native constructs like Scala's for-comprehensions - a declarative syntax akin to SQL. In addition, Emma also advocates quasi-quoting the entire data analysis algorithm rather than its individual dataflow expressions. This allows for decomposing the quoted code into (sequential) control flow and (parallel) dataflow fragments, optimizing the dataflows in context, and transparently offloading them to an engine like Spark or Flink. The proposed design promises increased programmer productivity due to avoiding an impedance mismatch, thereby reducing the lag times and cost of data analysis.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Emma in Action:可扩展数据分析的声明性数据流

基于二阶函数的并行数据流api最初被视为SQL的灵活替代方案。然而，随着时间的推移，它们的复杂性增加了，因为底层引擎必须公开许多物理方面，以促进有效的执行。为了保持足够的抽象水平并降低数据科学家的进入门槛，像Spark和Flink这样的项目目前在并行集合抽象的基础上提供了特定领域的api。这个演示突出了基于深度语言嵌入的另一种设计的好处。我们展示了Emma——一种嵌入Scala的编程语言。Emma通过Scala的for-comprehension(一种类似于SQL的声明性语法)这样的本地结构来促进并行集合处理。此外，Emma还提倡准引用整个数据分析算法，而不是单个数据流表达式。这允许将引用的代码分解为(顺序的)控制流和(并行的)数据流片段，在上下文中优化数据流，并透明地将它们卸载到像Spark或Flink这样的引擎。由于避免了阻抗不匹配，因此建议的设计承诺提高程序员的工作效率，从而减少延迟时间和数据分析的成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2016 International Conference on Management of Data

自引率

0.00%

发文量

期刊最新文献

An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory Rheem: Enabling Multi-Platform Task Execution Wander Join: Online Aggregation for Joins Graph Summarization for Geo-correlated Trends Detection in Social Networks Emma in Action: Declarative Dataflows for Scalable Data Analysis