Application checkpointing in grid environment with improved checkpoint reliability through replication

R. K. Bawa, R. Singh
{"title":"Application checkpointing in grid environment with improved checkpoint reliability through replication","authors":"R. K. Bawa, R. Singh","doi":"10.1109/ICCCNT.2012.6395974","DOIUrl":null,"url":null,"abstract":"Grid technologies are emerging as the next generation of distributed computing, allowing the aggregation of heterogeneous resources that are geographically distributed. The heterogeneous nature of the grid makes it more vulnerable to faults which lead to either the failure of the job or delay in completing the execution of the job. Checkpointing is one of the many fault tolerance techniques which are used to make Grid more efficient and reliable. In this paper we have developed an application checkpointing based fault tolerance technique for Alchemi based Grid environment. In this technique application threads generate their checkpoints and store them in the checkpoint table at the manager node. In case a thread fails checkpoint of the corresponding thread is used to resume the execution from the point of failure. This technique introduces a slight overhead in fault free situations but very effective in case of a node failure. Increased checkpoint frequency improves job's resuming capability but also increases the overhead of generating and storing checkpoints which results in increased processing time of the job.","PeriodicalId":364589,"journal":{"name":"2012 Third International Conference on Computing, Communication and Networking Technologies (ICCCNT'12)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2012-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Third International Conference on Computing, Communication and Networking Technologies (ICCCNT'12)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCNT.2012.6395974","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Grid technologies are emerging as the next generation of distributed computing, allowing the aggregation of heterogeneous resources that are geographically distributed. The heterogeneous nature of the grid makes it more vulnerable to faults which lead to either the failure of the job or delay in completing the execution of the job. Checkpointing is one of the many fault tolerance techniques which are used to make Grid more efficient and reliable. In this paper we have developed an application checkpointing based fault tolerance technique for Alchemi based Grid environment. In this technique application threads generate their checkpoints and store them in the checkpoint table at the manager node. In case a thread fails checkpoint of the corresponding thread is used to resume the execution from the point of failure. This technique introduces a slight overhead in fault free situations but very effective in case of a node failure. Increased checkpoint frequency improves job's resuming capability but also increases the overhead of generating and storing checkpoints which results in increased processing time of the job.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
网格环境中的应用程序检查点,通过复制提高检查点可靠性
网格技术作为下一代分布式计算技术正在兴起,它允许聚合地理上分布的异构资源。网格的异构特性使其更容易受到故障的影响,从而导致作业失败或延迟完成作业的执行。检查点是众多容错技术中的一种,用于提高网格的效率和可靠性。本文针对基于Alchemi的网格环境,开发了一种基于应用程序检查点的容错技术。在这种技术中,应用程序线程生成它们的检查点,并将它们存储在管理器节点的检查点表中。如果线程失败,则使用相应线程的检查点从失败点恢复执行。这种技术在无故障情况下会带来轻微的开销,但在节点发生故障时非常有效。检查点频率的增加提高了作业的恢复能力,但也增加了生成和存储检查点的开销,从而增加了作业的处理时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Image analysis system for 96-well plate fluorescence assays Empirical evaluation of image reconstruction techniques Continuous monitoring of heart rate variability and haemodynamic stability of an automobile driver to prevent road accidents Shared aperture printed slot antenna Detecting salient regions in static images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1