Lineage stash: fault tolerance off the critical path

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI:10.1145/3341301.3359653

Stephanie Wang, J. Liagouris, Robert Nishihara, Philipp Moritz, Ujval Misra, Alexey Tumanov, I. Stoica

{"title":"Lineage stash: fault tolerance off the critical path","authors":"Stephanie Wang, J. Liagouris, Robert Nishihara, Philipp Moritz, Ujval Misra, Alexey Tumanov, I. Stoica","doi":"10.1145/3341301.3359653","DOIUrl":null,"url":null,"abstract":"As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed in mission critical applications and on larger and larger clusters, their ability to tolerate failures is growing in importance. These frameworks employ two broad approaches for fault tolerance: checkpointing and lineage. Checkpointing exhibits low overhead during normal operation but high overhead during recovery, while lineage-based solutions make the opposite tradeoff. We propose the lineage stash, a decentralized causal logging technique that significantly reduces the runtime overhead of lineage-based approaches without impacting recovery efficiency. With the lineage stash, instead of recording the task's information before the task is executed, we record it asynchronously and forward the lineage along with the task. This makes it possible to support large-scale, low-latency (millisecond-level) data processing applications with low runtime and recovery overheads. Experimental results for applications in distributed training and stream processing show that the lineage stash provides task execution latencies similar to checkpointing alone, while incurring a recovery overhead as low as traditional lineage-based approaches.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":"191 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3341301.3359653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

Abstract

As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed in mission critical applications and on larger and larger clusters, their ability to tolerate failures is growing in importance. These frameworks employ two broad approaches for fault tolerance: checkpointing and lineage. Checkpointing exhibits low overhead during normal operation but high overhead during recovery, while lineage-based solutions make the opposite tradeoff. We propose the lineage stash, a decentralized causal logging technique that significantly reduces the runtime overhead of lineage-based approaches without impacting recovery efficiency. With the lineage stash, instead of recording the task's information before the task is executed, we record it asynchronously and forward the lineage along with the task. This makes it possible to support large-scale, low-latency (millisecond-level) data processing applications with low runtime and recovery overheads. Experimental results for applications in distributed training and stream processing show that the lineage stash provides task execution latencies similar to checkpointing alone, while incurring a recovery overhead as low as traditional lineage-based approaches.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

沿袭存储:关键路径之外的容错

随着Spark、Dryad、Flink和Ray等集群计算框架被部署在任务关键型应用程序和越来越大的集群上，它们容忍故障的能力变得越来越重要。这些框架采用两种广泛的容错方法:检查点和沿袭。检查点在正常操作期间显示低开销，但在恢复期间显示高开销，而基于继承的解决方案则进行相反的权衡。我们提出了沿袭存储，这是一种分散的因果日志记录技术，可以在不影响恢复效率的情况下显著降低基于沿袭方法的运行时开销。使用沿袭存储，我们不是在执行任务之前记录任务的信息，而是异步记录它，并将沿袭与任务一起转发。这使得支持具有低运行时和恢复开销的大规模、低延迟(毫秒级)数据处理应用程序成为可能。分布式训练和流处理应用程序的实验结果表明，沿袭存储提供了类似于单独检查点的任务执行延迟，同时产生的恢复开销与传统的基于沿袭的方法一样低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊