Efficient fault-tolerance for iterative graph processing on distributed dataflow systems

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI:10.1109/ICDE.2016.7498275

Chen Xu, M. Holzemer, Manohar Kaul, V. Markl

{"title":"Efficient fault-tolerance for iterative graph processing on distributed dataflow systems","authors":"Chen Xu, M. Holzemer, Manohar Kaul, V. Markl","doi":"10.1109/ICDE.2016.7498275","DOIUrl":null,"url":null,"abstract":"Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"36 1","pages":"613-624"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498275","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

分布式数据流系统中迭代图处理的高效容错

实际的图形处理应用程序通常需要将图形数据与表格数据相结合。此外，图形处理通常是由数据准备、分析和模型构建以及模型应用组成的更大的分析工作流的一部分。通用分布式数据流框架整体地执行这些工作流的所有步骤。这种整体视图使这些系统能够推理并自动优化处理。大多数大型图形处理算法都是迭代的，并且会产生很长的运行时间，因为它们需要多次遍历数据直到收敛。因此，在工作流程的任何步骤中，容错和从任何间歇性故障中快速恢复对于有效和高效的分析至关重要。在这项工作中，我们提出了一种新的容错机制，用于分布式数据流系统的迭代图处理，目的是减少检查点成本和故障恢复时间。我们的机制不是写阻塞下游操作符的检查点，而是以一种非阻塞的方式写检查点，而不会破坏流水线任务。与典型的无阻塞检查点方法(即，独立管理不可变数据集的检查点)相比，我们将可变数据集的检查点注入迭代数据流本身。因此，我们的机制在设计上是迭代感知的。这简化了系统架构，并便于在迭代图处理期间协调检查点创建。通过使用每个节点上的本地日志文件来避免从头开始的完全重新计算，我们实现了更快的恢复，即受限的恢复。我们的理论研究以及我们对Flink的实验分析进一步深入了解了我们的容错策略，并表明它们比数据流系统上迭代图处理的阻塞检查点和完全恢复更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量

期刊最新文献

Data profiling SEED: A system for entity exploration and debugging in large-scale knowledge graphs TemProRA: Top-k temporal-probabilistic results analysis Durable graph pattern queries on historical graphs SCouT: Scalable coupled matrix-tensor factorization - algorithm and discoveries