GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC

Proceedings of the 37th International Conference on Supercomputing Pub Date : 2023-06-21 DOI:10.1145/3577193.3593732

Guangnan Feng, Dezun Dong, Shizhen Zhao, Yutong Lu

{"title":"GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC","authors":"Guangnan Feng, Dezun Dong, Shizhen Zhao, Yutong Lu","doi":"10.1145/3577193.3593732","DOIUrl":null,"url":null,"abstract":"Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has been adopted in new exascale High Performance Computing (HPC) systems. However, Dragonfly topology suffers from the limited direct links between groups. The reconfigurable network can solve this problem by reconfiguring topology to adjust the number of direct links between groups. While the performance improvement of a single job on reconfigurable HPC network has been evaluated in previous works, the performance of HPC workloads has not been studied because of the lack of an appropriate resource allocation policy. In this work, we propose Group-level Resource Allocation Policy (GRAP) to allocate both compute nodes and Reconfigurable Links for jobs in Reconfigurable Dragonfly Network (RDN). We start with formulating three design principles: reconfigurable network should be reconfiguration interference-free, guarantee connectivity and performance for each job, and satisfy varied resource requests. According to the principles, GRAP uses different strategies for small and large jobs, and contains three allocation modes for large jobs: Balance Mode, Custom Mode, and Adaptive Mode. Finally, we evaluate GRAP with the CODES network simulation framework and the Slurm Simulator using real workload traces. The results demonstrate that RDN coupled with GRAP achieves lower latency, higher bandwidth, and lower job wait time.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593732","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has been adopted in new exascale High Performance Computing (HPC) systems. However, Dragonfly topology suffers from the limited direct links between groups. The reconfigurable network can solve this problem by reconfiguring topology to adjust the number of direct links between groups. While the performance improvement of a single job on reconfigurable HPC network has been evaluated in previous works, the performance of HPC workloads has not been studied because of the lack of an appropriate resource allocation policy. In this work, we propose Group-level Resource Allocation Policy (GRAP) to allocate both compute nodes and Reconfigurable Links for jobs in Reconfigurable Dragonfly Network (RDN). We start with formulating three design principles: reconfigurable network should be reconfiguration interference-free, guarantee connectivity and performance for each job, and satisfy varied resource requests. According to the principles, GRAP uses different strategies for small and large jobs, and contains three allocation modes for large jobs: Balance Mode, Custom Mode, and Adaptive Mode. Finally, we evaluate GRAP with the CODES network simulation framework and the Slurm Simulator using real workload traces. The results demonstrate that RDN coupled with GRAP achieves lower latency, higher bandwidth, and lower job wait time.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高性能计算中可重构蜻蜓网络的组级资源分配策略

蜻蜓是一种高度可扩展、低直径和低成本的网络拓扑结构，已被用于新的百亿亿级高性能计算(HPC)系统。然而，蜻蜓拓扑结构的缺点是群体之间的直接连接有限。可重构网络可以通过重新配置拓扑来调整组间直连链路的数量，从而解决这一问题。虽然以前的研究已经对可重构高性能计算网络中单个作业的性能改进进行了评估，但由于缺乏适当的资源分配策略，尚未对高性能计算工作负载的性能进行研究。在这项工作中，我们提出了组级资源分配策略(GRAP)来为可重构蜻蜓网络(RDN)中的作业分配计算节点和可重构链路。我们首先制定了三个设计原则:可重构网络应该是无重构干扰的，保证每个作业的连接性和性能，并满足各种资源请求。根据原理，GRAP对小作业和大作业采用不同的分配策略，对大作业包含三种分配模式:Balance Mode、Custom Mode和Adaptive Mode。最后，我们使用CODES网络仿真框架和使用真实工作负载跟踪的Slurm模拟器来评估GRAP。结果表明，RDN与GRAP相结合可以实现更低的延迟、更高的带宽和更短的作业等待时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 37th International Conference on Supercomputing

自引率

0.00%

发文量