{"title":"探讨高可用性mpi的重要性","authors":"Hakon O. Bugge","doi":"10.1145/1188455.1188496","DOIUrl":null,"url":null,"abstract":"Pharmaceutical research. Weather prediction. Oil exploration. The data analysis demanded for these jobs can be awesome. As more applications are running on Linux clusters, there are a number of applications where job completion is critical. Today's jobs are getting longer and it's not unusual to come across jobs with run times that last for multiple days. As the number of nodes in a cluster expands, the likelihood that a job will be able to complete without a hardware related failure becomes statistically relevant. For an application like this, the \"cost\" of having the job fail and having to restart the job is enormous. You need efficient ways to help drive jobs to completion or be able to recover from failures.This session will review the importance of high availability functionality in high performance computing MPIs when running communication-intensive applications. Different approaches for cooperative and distributed check-point-restart will also be explored.","PeriodicalId":115940,"journal":{"name":"Proceedings of the 2006 ACM/IEEE conference on Supercomputing","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring the importance of high availability MPIs\",\"authors\":\"Hakon O. Bugge\",\"doi\":\"10.1145/1188455.1188496\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pharmaceutical research. Weather prediction. Oil exploration. The data analysis demanded for these jobs can be awesome. As more applications are running on Linux clusters, there are a number of applications where job completion is critical. Today's jobs are getting longer and it's not unusual to come across jobs with run times that last for multiple days. As the number of nodes in a cluster expands, the likelihood that a job will be able to complete without a hardware related failure becomes statistically relevant. For an application like this, the \\\"cost\\\" of having the job fail and having to restart the job is enormous. You need efficient ways to help drive jobs to completion or be able to recover from failures.This session will review the importance of high availability functionality in high performance computing MPIs when running communication-intensive applications. Different approaches for cooperative and distributed check-point-restart will also be explored.\",\"PeriodicalId\":115940,\"journal\":{\"name\":\"Proceedings of the 2006 ACM/IEEE conference on Supercomputing\",\"volume\":\"112 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2006 ACM/IEEE conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1188455.1188496\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2006 ACM/IEEE conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1188455.1188496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Exploring the importance of high availability MPIs
Pharmaceutical research. Weather prediction. Oil exploration. The data analysis demanded for these jobs can be awesome. As more applications are running on Linux clusters, there are a number of applications where job completion is critical. Today's jobs are getting longer and it's not unusual to come across jobs with run times that last for multiple days. As the number of nodes in a cluster expands, the likelihood that a job will be able to complete without a hardware related failure becomes statistically relevant. For an application like this, the "cost" of having the job fail and having to restart the job is enormous. You need efficient ways to help drive jobs to completion or be able to recover from failures.This session will review the importance of high availability functionality in high performance computing MPIs when running communication-intensive applications. Different approaches for cooperative and distributed check-point-restart will also be explored.