Optimizing the Resource and Job Management System of an Academic HPC & Research Computing Facility

2022 21st International Symposium on Parallel and Distributed Computing (ISPDC) Pub Date : 2022-07-01 DOI:10.1109/ISPDC55340.2022.00027

S. Varrette, Emmanuel Kieffer, F. Pinel

{"title":"Optimizing the Resource and Job Management System of an Academic HPC & Research Computing Facility","authors":"S. Varrette, Emmanuel Kieffer, F. Pinel","doi":"10.1109/ISPDC55340.2022.00027","DOIUrl":null,"url":null,"abstract":"High Performance Computing (HPC) is nowadays a strategic asset required to sustain the surging demands for massive processing and data-analytic capabilities. In practice, the effective management of such large scale and distributed computing infrastructures is left to a Resource and Job Management System (RJMS). This essential middleware component is responsible for managing the computing resources, handling user requests to allocate resources while providing an optimized framework for starting, executing and monitoring jobs on the allocated resources. The University of Luxembourg has been operating for 15 years a large academic HPC facility which relies since 2017 on the Slurm RJMS introduced on top of the flagship cluster Iris. The acquisition of a new liquid-cooled supercomputer named Aion which was released in 2021 was the occasion to deeply review and optimize the seminal Slurm configuration, the resource limits defined and the sustaining fairsharing algorithm.This paper presents the outcomes of this study and details the implemented RJMS policy. The impact of the decisions made over the supercomputers workloads is also described. In particular, the performance evaluation conducted highlights that when compared to the seminal configuration, the described and implemented environment brought concrete and measurable improvements with regards the platform utilization (+12.64%), the jobs efficiency (as measured by the average Wall-time Request Accuracy, improved by 110.81%) or the management and funding (increased by 10%). The systems demonstrated sustainable and scalable HPC performances, and this effort has led to a negligible penalty on the average slowdown metric (response time normalized by runtime), which was increased by 0.59% for job workloads covering a complete year of exercise. Overall, this new setup has been in production for 18 months on both supercomputers and the updated model proves to bring a fairer and more satisfying experience to the end users. The proposed configurations and policies may help other HPC centres when designing or improving the RJMS sustaining the job scheduling strategy at the advent of computing capacity expansions.","PeriodicalId":389334,"journal":{"name":"2022 21st International Symposium on Parallel and Distributed Computing (ISPDC)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st International Symposium on Parallel and Distributed Computing (ISPDC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPDC55340.2022.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

High Performance Computing (HPC) is nowadays a strategic asset required to sustain the surging demands for massive processing and data-analytic capabilities. In practice, the effective management of such large scale and distributed computing infrastructures is left to a Resource and Job Management System (RJMS). This essential middleware component is responsible for managing the computing resources, handling user requests to allocate resources while providing an optimized framework for starting, executing and monitoring jobs on the allocated resources. The University of Luxembourg has been operating for 15 years a large academic HPC facility which relies since 2017 on the Slurm RJMS introduced on top of the flagship cluster Iris. The acquisition of a new liquid-cooled supercomputer named Aion which was released in 2021 was the occasion to deeply review and optimize the seminal Slurm configuration, the resource limits defined and the sustaining fairsharing algorithm.This paper presents the outcomes of this study and details the implemented RJMS policy. The impact of the decisions made over the supercomputers workloads is also described. In particular, the performance evaluation conducted highlights that when compared to the seminal configuration, the described and implemented environment brought concrete and measurable improvements with regards the platform utilization (+12.64%), the jobs efficiency (as measured by the average Wall-time Request Accuracy, improved by 110.81%) or the management and funding (increased by 10%). The systems demonstrated sustainable and scalable HPC performances, and this effort has led to a negligible penalty on the average slowdown metric (response time normalized by runtime), which was increased by 0.59% for job workloads covering a complete year of exercise. Overall, this new setup has been in production for 18 months on both supercomputers and the updated model proves to bring a fairer and more satisfying experience to the end users. The proposed configurations and policies may help other HPC centres when designing or improving the RJMS sustaining the job scheduling strategy at the advent of computing capacity expansions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

学术高性能计算与科研计算设备资源与作业管理系统的优化

高性能计算(HPC)现在是一种战略资产，需要维持对大规模处理和数据分析能力不断增长的需求。在实践中，这种大规模和分布式计算基础设施的有效管理留给了资源和作业管理系统(RJMS)。这个重要的中间件组件负责管理计算资源，处理分配资源的用户请求，同时为启动、执行和监视分配资源上的作业提供优化的框架。卢森堡大学的大型学术高性能计算设施已经运营了15年，该设施自2017年以来一直依赖于在旗舰集群Iris之上引入的Slurm RJMS。收购2021年发布的新型液冷超级计算机“Aion”，是对开创性的Slurm配置、资源限制定义和持续公平共享算法进行深入审查和优化的机会。本文介绍了本研究的结果，并详细介绍了实施的RJMS策略。还描述了对超级计算机工作负载做出的决策的影响。特别是，所进行的性能评估强调，与原始配置相比，所描述和实现的环境在平台利用率(+12.64%)、工作效率(通过平均Wall-time请求精度测量，提高了110.81%)或管理和资金(增加了10%)方面带来了具体和可衡量的改进。系统展示了可持续和可扩展的HPC性能，并且这一努力在平均减速指标(按运行时标准化的响应时间)上的损失可以忽略不计，对于覆盖一整年的作业负载，该指标增加了0.59%。总的来说，这个新设置已经在两台超级计算机上运行了18个月，更新后的模型证明给最终用户带来了更公平、更令人满意的体验。所提出的配置和策略可以帮助其他HPC中心在设计或改进RJMS时，在计算能力扩展时维持作业调度策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 21st International Symposium on Parallel and Distributed Computing (ISPDC)

自引率

0.00%

发文量

期刊最新文献

Estimating the Impact of Communication Schemes for Distributed Graph Processing Sponsors and Conference Support Performance Comparison of Speculative Taskloop and OpenMP-for-Loop Thread-Level Speculation on Hardware Transactional Memory [Full] Deep Heuristic for Broadcasting in Arbitrary Networks Analysis and Mitigation of Soft-Errors on High Performance Embedded GPUs