探讨高可用性mpi的重要性

Hakon O. Bugge
{"title":"探讨高可用性mpi的重要性","authors":"Hakon O. Bugge","doi":"10.1145/1188455.1188496","DOIUrl":null,"url":null,"abstract":"Pharmaceutical research. Weather prediction. Oil exploration. The data analysis demanded for these jobs can be awesome. As more applications are running on Linux clusters, there are a number of applications where job completion is critical. Today's jobs are getting longer and it's not unusual to come across jobs with run times that last for multiple days. As the number of nodes in a cluster expands, the likelihood that a job will be able to complete without a hardware related failure becomes statistically relevant. For an application like this, the \"cost\" of having the job fail and having to restart the job is enormous. You need efficient ways to help drive jobs to completion or be able to recover from failures.This session will review the importance of high availability functionality in high performance computing MPIs when running communication-intensive applications. Different approaches for cooperative and distributed check-point-restart will also be explored.","PeriodicalId":115940,"journal":{"name":"Proceedings of the 2006 ACM/IEEE conference on Supercomputing","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring the importance of high availability MPIs\",\"authors\":\"Hakon O. Bugge\",\"doi\":\"10.1145/1188455.1188496\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pharmaceutical research. Weather prediction. Oil exploration. The data analysis demanded for these jobs can be awesome. As more applications are running on Linux clusters, there are a number of applications where job completion is critical. Today's jobs are getting longer and it's not unusual to come across jobs with run times that last for multiple days. As the number of nodes in a cluster expands, the likelihood that a job will be able to complete without a hardware related failure becomes statistically relevant. For an application like this, the \\\"cost\\\" of having the job fail and having to restart the job is enormous. You need efficient ways to help drive jobs to completion or be able to recover from failures.This session will review the importance of high availability functionality in high performance computing MPIs when running communication-intensive applications. Different approaches for cooperative and distributed check-point-restart will also be explored.\",\"PeriodicalId\":115940,\"journal\":{\"name\":\"Proceedings of the 2006 ACM/IEEE conference on Supercomputing\",\"volume\":\"112 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2006 ACM/IEEE conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1188455.1188496\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2006 ACM/IEEE conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1188455.1188496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

医药研究。天气预报。石油勘探。这些工作所需要的数据分析是非常棒的。随着越来越多的应用程序在Linux集群上运行,有许多应用程序的作业完成非常关键。如今的作业越来越长,并且经常会遇到运行时间长达数天的作业。随着集群中节点数量的增加,作业能够在没有硬件相关故障的情况下完成的可能性在统计上变得相关。对于这样的应用程序,作业失败和必须重新启动作业的“成本”是巨大的。您需要有效的方法来帮助完成工作或能够从失败中恢复。本课程将回顾在运行通信密集型应用程序时,高性能计算mpi中高可用性功能的重要性。本文还将探讨协作式和分布式检查点重新启动的不同方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Exploring the importance of high availability MPIs
Pharmaceutical research. Weather prediction. Oil exploration. The data analysis demanded for these jobs can be awesome. As more applications are running on Linux clusters, there are a number of applications where job completion is critical. Today's jobs are getting longer and it's not unusual to come across jobs with run times that last for multiple days. As the number of nodes in a cluster expands, the likelihood that a job will be able to complete without a hardware related failure becomes statistically relevant. For an application like this, the "cost" of having the job fail and having to restart the job is enormous. You need efficient ways to help drive jobs to completion or be able to recover from failures.This session will review the importance of high availability functionality in high performance computing MPIs when running communication-intensive applications. Different approaches for cooperative and distributed check-point-restart will also be explored.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Statistical inference for efficient microarchitectural and application analysis The meeting list tool - a shared application for sharing dynamic information in meetings Liquid cooling: a next generation data center strategy Performance and presentation production elements Implementing algorithms on FPGAs using high-level languages and low-level libraries
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1