Designing Hierarchical Multi-HCA Aware Allgather in MPI

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3547276.3548524

Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda

{"title":"Designing Hierarchical Multi-HCA Aware Allgather in MPI","authors":"Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda","doi":"10.1145/3547276.3548524","DOIUrl":null,"url":null,"abstract":"To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"180 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3547276.3548524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MPI中分层多hca感知集合的设计

为了加速节点之间的通信，超级计算机现在为每个节点配备了多个网络适配器，从而形成了“多轨”网络。Top500中排名第二和第三的系统每个节点使用两个适配器;最近，阿贡国家实验室(ANL)的ThetaGPU系统在每个节点上使用8个适配器。有了如此多的网络资源，利用所有这些资源是一项非常重要的任务。消息传递接口(Message Passing Interface, MPI)是高性能计算集群的主流模型。并非所有MPI集合都利用所有资源，随着给定集群中带宽和适配器数量的增加，这一点变得更加明显。在这项工作中，我们承担了这个任务，提出了分层的、多hca感知的Allgather设计;Allgather是一个通信密集型集合，广泛用于矩阵乘法和其他集合等应用程序。所提出的设计充分利用节点内所有可用的网络适配器，并提供节点间和节点内通信的高度重叠。在微基准测试水平上，我们的新方案实现了单节点和多节点通信的性能改进。对于1024个进程，我们看到节点间的改进比HPC-X和MVAPICH2-X分别提高了62%和61%。与HPC-X和MVAPICH2-X相比，节点间通信的设计还使Ring Allreduce的性能提高了56%和44%。在应用程序层面，与HPC-X和MVAPICH2-X相比，增强的Allgather在矩阵向量乘法内核上的性能提高了1.98倍和1.42倍，而Allreduce在深度学习训练中比MVAPICH2-X的性能提高了7.83%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Workshop Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量

期刊最新文献

A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA A Fast and Secure AKA Protocol for B5G Execution Flow Aware Profiling for ROS-based Autonomous Vehicle Software A User-Based Bike Return Algorithm for Docked Bike Sharing Systems Extracting High Definition Map Information from Aerial Images