Mayank Pundir, Luke M. Leslie, Indranil Gupta, R. Campbell
{"title":"Zorro: zero-cost reactive failure recovery in distributed graph processing","authors":"Mayank Pundir, Luke M. Leslie, Indranil Gupta, R. Campbell","doi":"10.1145/2806777.2806934","DOIUrl":null,"url":null,"abstract":"Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems -- PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6--12% of the servers fail, and between 87--95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixth ACM Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2806777.2806934","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30
Abstract
Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems -- PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6--12% of the servers fail, and between 87--95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.