探讨高可用性mpi的重要性

Proceedings of the 2006 ACM/IEEE conference on Supercomputing Pub Date : 2006-11-11 DOI:10.1145/1188455.1188496

Hakon O. Bugge

{"title":"探讨高可用性mpi的重要性","authors":"Hakon O. Bugge","doi":"10.1145/1188455.1188496","DOIUrl":null,"url":null,"abstract":"Pharmaceutical research. Weather prediction. Oil exploration. The data analysis demanded for these jobs can be awesome. As more applications are running on Linux clusters, there are a number of applications where job completion is critical. Today's jobs are getting longer and it's not unusual to come across jobs with run times that last for multiple days. As the number of nodes in a cluster expands, the likelihood that a job will be able to complete without a hardware related failure becomes statistically relevant. For an application like this, the \"cost\" of having the job fail and having to restart the job is enormous. You need efficient ways to help drive jobs to completion or be able to recover from failures.This session will review the importance of high availability functionality in high performance computing MPIs when running communication-intensive applications. Different approaches for cooperative and distributed check-point-restart will also be explored.","PeriodicalId":115940,"journal":{"name":"Proceedings of the 2006 ACM/IEEE conference on Supercomputing","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring the importance of high availability MPIs\",\"authors\":\"Hakon O. Bugge\",\"doi\":\"10.1145/1188455.1188496\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pharmaceutical research. Weather prediction. Oil exploration. The data analysis demanded for these jobs can be awesome. As more applications are running on Linux clusters, there are a number of applications where job completion is critical. Today's jobs are getting longer and it's not unusual to come across jobs with run times that last for multiple days. As the number of nodes in a cluster expands, the likelihood that a job will be able to complete without a hardware related failure becomes statistically relevant. For an application like this, the \\\"cost\\\" of having the job fail and having to restart the job is enormous. You need efficient ways to help drive jobs to completion or be able to recover from failures.This session will review the importance of high availability functionality in high performance computing MPIs when running communication-intensive applications. Different approaches for cooperative and distributed check-point-restart will also be explored.\",\"PeriodicalId\":115940,\"journal\":{\"name\":\"Proceedings of the 2006 ACM/IEEE conference on Supercomputing\",\"volume\":\"112 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2006 ACM/IEEE conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1188455.1188496\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2006 ACM/IEEE conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1188455.1188496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

医药研究。天气预报。石油勘探。这些工作所需要的数据分析是非常棒的。随着越来越多的应用程序在Linux集群上运行，有许多应用程序的作业完成非常关键。如今的作业越来越长，并且经常会遇到运行时间长达数天的作业。随着集群中节点数量的增加，作业能够在没有硬件相关故障的情况下完成的可能性在统计上变得相关。对于这样的应用程序，作业失败和必须重新启动作业的“成本”是巨大的。您需要有效的方法来帮助完成工作或能够从失败中恢复。本课程将回顾在运行通信密集型应用程序时，高性能计算mpi中高可用性功能的重要性。本文还将探讨协作式和分布式检查点重新启动的不同方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Exploring the importance of high availability MPIs

Pharmaceutical research. Weather prediction. Oil exploration. The data analysis demanded for these jobs can be awesome. As more applications are running on Linux clusters, there are a number of applications where job completion is critical. Today's jobs are getting longer and it's not unusual to come across jobs with run times that last for multiple days. As the number of nodes in a cluster expands, the likelihood that a job will be able to complete without a hardware related failure becomes statistically relevant. For an application like this, the "cost" of having the job fail and having to restart the job is enormous. You need efficient ways to help drive jobs to completion or be able to recover from failures.This session will review the importance of high availability functionality in high performance computing MPIs when running communication-intensive applications. Different approaches for cooperative and distributed check-point-restart will also be explored.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2006 ACM/IEEE conference on Supercomputing

自引率

0.00%

发文量

期刊最新文献

Statistical inference for efficient microarchitectural and application analysis The meeting list tool - a shared application for sharing dynamic information in meetings Liquid cooling: a next generation data center strategy Performance and presentation production elements Implementing algorithms on FPGAs using high-level languages and low-level libraries