MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

G. Bosilca, Aurélien Bouteiller, F. Cappello, Samir Djilali, G. Fedak, C. Germain, T. Hérault, Pierre Lemarinier, O. Lodygensky, F. Magniette, V. Néri, A. Selikhov
{"title":"MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes","authors":"G. Bosilca, Aurélien Bouteiller, F. Cappello, Samir Djilali, G. Fedak, C. Germain, T. Hérault, Pierre Lemarinier, O. Lodygensky, F. Magniette, V. Néri, A. Selikhov","doi":"10.1109/SC.2002.10048","DOIUrl":null,"url":null,"abstract":"Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"344","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IEEE SC 2002 Conference (SC'02)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.2002.10048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 344

Abstract

Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
面向易失性节点的可伸缩容错MPI
全球计算平台、大规模集群和未来的TeraGRID系统聚集了数千个节点,用于计算并行科学应用。在这种规模下,节点故障或断开连接是经常发生的事件。这种波动性降低了整个系统在小时或分钟范围内的MTBF。我们提出了MPICH-V,一个基于非协调检查点/回滚和分布式消息日志的自动容错MPI环境。MPICH-V架构依赖于通道存储器、检查点服务器和理论上经过验证的协议,在易变节点上执行现有或新的SPMD和Master-Worker MPI应用程序。为了评估其功能,我们在一个框架内运行MPICH-V,该框架可以完全配置节点、通道存储器和检查点服务器的数量以及节点波动性。我们给出了MPICH-V的每个组件的详细性能评估及其在非平凡并行应用中的全局性能。实验结果表明,该算法具有良好的可扩展性和对节点波动的容忍度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Interoperable Web Services for Computational Portals Advanced Visualization Technology for Terascale Particle Accelerator Simulations Library Support for Hierarchical Multi-Processor Tasks Utilization of Departmental Computing GRID System for Development of an Artificial Intelligent Tapping Inspection Method, Tapping Sound Analysis 16.4-Tflops Direct Numerical Simulation of Turbulence by a Fourier Spectral Method on the Earth Simulator
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1