Big Provenance Stream Processing for Data Intensive Computations

2018 IEEE 14th International Conference on e-Science (e-Science) Pub Date : 2018-10-01 DOI:10.1109/eScience.2018.00039

Isuru Suriarachchi, S. Withana, Beth Plale

引用次数: 10

Abstract

In the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with both traceability and management. The sheer volume of fine-grained provenance generated in a multi-framework application motivates the need for on-the-fly provenance processing. We introduce a new parallel stream processing algorithm that reduces fine-grained provenance while preserving backward and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. It is evaluated using several strategies for partitioning a provenance stream. The evaluation shows that the parallel algorithm performs well in processing out-of-order provenance streams, with good scalability and accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于数据密集计算的大来源流处理

在当今的商业和研究领域，数据分析消耗来自众多来源的公共和专有数据，并利用任何一个或多个流行的数据并行框架，如Hadoop、Spark和Flink。在数据湖设置中，这些框架共存。我们早期的工作表明，数据湖中的数据来源可以帮助实现可追溯性和管理。在多框架应用程序中生成的大量细粒度来源激发了对动态来源处理的需求。我们引入了一种新的并行流处理算法，在保留向后和向前溯源的同时减少了细粒度的溯源。该算法对无序到达的来源事件具有弹性。它使用几种策略来划分一个来源流。仿真结果表明，该算法在处理乱序源流方面表现良好，具有良好的可扩展性和准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 IEEE 14th International Conference on e-Science (e-Science)

自引率

0.00%

发文量