迈向弹性的欧盟高性能计算系统:蓝图

Proceedings of the 16th ACM International Conference on Computing Frontiers Pub Date : 2019-04-30 DOI:10.1145/3310273.3323434

Petar Radojkovic

{"title":"迈向弹性的欧盟高性能计算系统:蓝图","authors":"Petar Radojkovic","doi":"10.1145/3310273.3323434","DOIUrl":null,"url":null,"abstract":"In high-performance computing (HPC) a single tightly-coupled job may execute for days on thousands of servers. Since a server failure typically leads to cascading effects on the whole job, requiring redundancy and/or aggressive checkpointing to prevent the whole job from failing. This has an adverse impact on the system performance and resource usage; which limits the ability to scale to larger systems. System resiliency is therefore one of the most important Exascale requirements and challenges.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Towards resilient EU HPC systems: a blueprint\",\"authors\":\"Petar Radojkovic\",\"doi\":\"10.1145/3310273.3323434\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In high-performance computing (HPC) a single tightly-coupled job may execute for days on thousands of servers. Since a server failure typically leads to cascading effects on the whole job, requiring redundancy and/or aggressive checkpointing to prevent the whole job from failing. This has an adverse impact on the system performance and resource usage; which limits the ability to scale to larger systems. System resiliency is therefore one of the most important Exascale requirements and challenges.\",\"PeriodicalId\":431860,\"journal\":{\"name\":\"Proceedings of the 16th ACM International Conference on Computing Frontiers\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th ACM International Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3310273.3323434\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3310273.3323434","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

在高性能计算(HPC)中，一个紧密耦合的作业可能在数千台服务器上执行数天。由于服务器故障通常会导致整个作业的级联效应，因此需要冗余和/或积极的检查点来防止整个作业失败。这对系统性能和资源使用有不利影响;这限制了扩展到更大系统的能力。因此，系统弹性是Exascale最重要的需求和挑战之一。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Towards resilient EU HPC systems: a blueprint

In high-performance computing (HPC) a single tightly-coupled job may execute for days on thousands of servers. Since a server failure typically leads to cascading effects on the whole job, requiring redundancy and/or aggressive checkpointing to prevent the whole job from failing. This has an adverse impact on the system performance and resource usage; which limits the ability to scale to larger systems. System resiliency is therefore one of the most important Exascale requirements and challenges.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 16th ACM International Conference on Computing Frontiers

自引率

0.00%

发文量