Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters

Oren Laadan, Dan B. Phung, Jason Nieh
{"title":"Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters","authors":"Oren Laadan, Dan B. Phung, Jason Nieh","doi":"10.1109/CLUSTR.2005.347039","DOIUrl":null,"url":null,"abstract":"We have created ZapC, a novel system for transparent coordinated checkpoint-restart of distributed network applications on commodity clusters. ZapC provides a thin visualization layer on top of the operating system that decouples a distributed application from dependencies on the cluster nodes on which it is executing. This decoupling enables ZapC to checkpoint an entire distributed application across all nodes in a coordinated manner such that it can he restarted from the checkpoint on a different set of cluster nodes at a later time. ZapC checkpoint-restart operations execute in parallel across different cluster nodes, providing faster checkpoint-restart performance. ZapC uniquely supports network state in a transport protocol independent manner, including correctly saving and restoring socket and protocol state for both TCP and UDP connections. We have implemented a ZapC Linux prototype and demonstrate that it provides low visualization overhead and fast checkpoint-restart times for distributed network applications without any application, library, kernel, or network protocol modifications","PeriodicalId":255312,"journal":{"name":"2005 IEEE International Conference on Cluster Computing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTR.2005.347039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 48

Abstract

We have created ZapC, a novel system for transparent coordinated checkpoint-restart of distributed network applications on commodity clusters. ZapC provides a thin visualization layer on top of the operating system that decouples a distributed application from dependencies on the cluster nodes on which it is executing. This decoupling enables ZapC to checkpoint an entire distributed application across all nodes in a coordinated manner such that it can he restarted from the checkpoint on a different set of cluster nodes at a later time. ZapC checkpoint-restart operations execute in parallel across different cluster nodes, providing faster checkpoint-restart performance. ZapC uniquely supports network state in a transport protocol independent manner, including correctly saving and restoring socket and protocol state for both TCP and UDP connections. We have implemented a ZapC Linux prototype and demonstrate that it provides low visualization overhead and fast checkpoint-restart times for distributed network applications without any application, library, kernel, or network protocol modifications
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
透明检查点——商品集群上分布式应用程序的重启
我们创建了ZapC,这是一个用于商品集群上分布式网络应用程序的透明协调检查点重启的新系统。ZapC在操作系统之上提供了一个瘦的可视化层,将分布式应用程序与它所执行的集群节点的依赖关系解耦。这种解耦使ZapC能够以协调的方式跨所有节点检查点整个分布式应用程序,以便以后可以从不同集群节点集上的检查点重新启动它。ZapC检查点重启操作跨不同集群节点并行执行,提供更快的检查点重启性能。ZapC以独立于传输协议的方式唯一地支持网络状态,包括正确保存和恢复TCP和UDP连接的套接字和协议状态。我们已经实现了一个ZapC Linux原型,并演示了它为分布式网络应用程序提供了较低的可视化开销和快速的检查点重新启动时间,而无需修改任何应用程序、库、内核或网络协议
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Performance Effects of Interrupt Throttle Rate on Linux Clusters using Intel Gigabit Network Adapters A pipelined data-parallel algorithm for ILP Distributed Out-of-Core Preprocessing of Very Large Microscopy Images for Efficient Querying Grid and Cluster Matrix Computation with Persistent Storage and Out-of-core Programming A Cost/Benefit Estimating Service for Mapping Parallel Applications on Heterogeneous Clusters
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1