Zorro: zero-cost reactive failure recovery in distributed graph processing

Proceedings of the Sixth ACM Symposium on Cloud Computing Pub Date : 2015-08-27 DOI:10.1145/2806777.2806934

Mayank Pundir, Luke M. Leslie, Indranil Gupta, R. Campbell

{"title":"Zorro: zero-cost reactive failure recovery in distributed graph processing","authors":"Mayank Pundir, Luke M. Leslie, Indranil Gupta, R. Campbell","doi":"10.1145/2806777.2806934","DOIUrl":null,"url":null,"abstract":"Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems -- PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6--12% of the servers fail, and between 87--95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixth ACM Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2806777.2806934","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems -- PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6--12% of the servers fail, and between 87--95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

佐罗:分布式图处理中的零成本无功故障恢复

分布式图形处理系统在很大程度上依赖于故障恢复的主动技术。不幸的是，这些方法(比如检查点)会带来很大的开销。在本文中，我们认为分布式图形处理系统应该使用响应式方法来恢复故障。响应式方法折衷了结果的完整性(生成稍微不准确的结果)，同时将无故障执行期间的开销减少到零。我们构建了一个名为佐罗的系统，它融入了这种反应式方法，并将佐罗集成到两个图形处理系统中——PowerGraph和LFGraph。当发生故障时，佐罗会利用当前图形处理系统中固有的顶点复制来快速重建故障服务器的状态。使用真实图的实验表明，当6- 12%的服务器故障时，Zorro能够恢复99%以上的图状态，当一半的集群故障时，Zorro能够恢复87- 95%。此外，使用各种图形处理算法，佐罗在所有实验故障场景中几乎没有精度损失，最坏情况下精度达到97%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Sixth ACM Symposium on Cloud Computing

自引率

0.00%

发文量

期刊最新文献

Software-defined caching: managing caches in multi-tenant data centers Managed communication and consistency for fast data-parallel iterative analytics MemcachedGPU: scaling-up scale-out key-value stores Database high availability using SHADOW systems Proceedings of the Sixth ACM Symposium on Cloud Computing