Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability

2010 39th International Conference on Parallel Processing Workshops Pub Date : 2010-09-13 DOI:10.1109/ICPPW.2010.63

P. Andrews, P. Kovatch, Victor Hazlewood, Troy Baer

{"title":"Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability","authors":"P. Andrews, P. Kovatch, Victor Hazlewood, Troy Baer","doi":"10.1109/ICPPW.2010.63","DOIUrl":null,"url":null,"abstract":"In late 2009, the National Institute for Computational Sciences placed in production the world’s fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for “hero” users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the “clearing out” of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 39th International Conference on Parallel Processing Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPPW.2010.63","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

In late 2009, the National Institute for Computational Sciences placed in production the world’s fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for “hero” users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the “clearing out” of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

调度100,000核超级计算机以获得最大的利用率和能力

2009年底，美国国家计算科学研究所(National Institute for Computational Sciences)投入生产了世界上最快的学术超级计算机(排名第三)，名为Kraken的克雷XT5，拥有近10万个计算核心，峰值速度超过1千万亿次。Kraken通过TeraGrid向美国国家科学基金会用户提供超过50%的可用总周期，这两项任务在历史上被证明是难以同时协调的:为社区提供最大数量的总周期，同时为“英雄”用户提供完整的机器运行。从历史上看，通过允许调度器为大型作业的开始选择正确的时间来尝试这一点，同时降低了利用率。在NICS，我们使用先前理论研究的结果来采用一种不同的方法，即每周强制对系统进行“清理”，然后连续运行全机。正如我们之前的模拟结果所表明的那样，这将导致利用率的显著提高，达到90%以上。传统调度策略和采用调度策略之间的利用率差异相当于一台300+ Teraflop的超级计算机，或者每年数百万美元的计算时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2010 39th International Conference on Parallel Processing Workshops

自引率

0.00%

发文量