面向PETSc的gpu感知非阻塞MPI邻域集体通信设计与优化

Kawthar Shafie Khorassani, Chen-Chun Chen, H. Subramoni, D. Panda
{"title":"面向PETSc的gpu感知非阻塞MPI邻域集体通信设计与优化","authors":"Kawthar Shafie Khorassani, Chen-Chun Chen, H. Subramoni, D. Panda","doi":"10.1109/IPDPS54959.2023.00070","DOIUrl":null,"url":null,"abstract":"MPI Neighborhood collectives are used for non-traditional collective operations involving uneven distribution of communication amongst processes such as sparse communication patterns. They provide flexibility to define the communication pattern involved when a neighborhood relationship can be defined. PETSc, the Portable, Extensible Toolkit for Scientific Computation, used extensively with scientific applications to provide scalable solutions through routines modeled by partial differential equations, utilizes neighborhood communication patterns to define various structures and routines.We propose GPU-aware MPI Neighborhood collective operations with support for AMD and NVIDIA GPU backends and propose optimized designs to provide scalable performance for various communication routines. We evaluate our designs using PETSc structures for scattering from a parallel vector to a parallel vector, scattering from a sequential vector to a parallel vector, and scattering from a parallel vector to a sequential vector using a star forest graph representation implemented with nonblocking MPI neighborhood alltoallv collective operations. We evaluate our neighborhood designs on 64 NVIDIA GPUs on the Lassen system with Infiniband networking, demonstrating30.90% improvement against a GPU implementation utilizing CPU-staging techniques, and 8.25% improvement against GPU-aware point-to-point implementations of the communication pattern. We also evaluate on 64 AMD GPUs on the Spock system with slingshot networking and present 39.52% improvement against the CPU-staging implementation of a neighborhood GPU vector type in PETSc, and 33.25% improvement against GPU-aware point-to-point implementation of the routine.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"252 10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc*\",\"authors\":\"Kawthar Shafie Khorassani, Chen-Chun Chen, H. Subramoni, D. Panda\",\"doi\":\"10.1109/IPDPS54959.2023.00070\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MPI Neighborhood collectives are used for non-traditional collective operations involving uneven distribution of communication amongst processes such as sparse communication patterns. They provide flexibility to define the communication pattern involved when a neighborhood relationship can be defined. PETSc, the Portable, Extensible Toolkit for Scientific Computation, used extensively with scientific applications to provide scalable solutions through routines modeled by partial differential equations, utilizes neighborhood communication patterns to define various structures and routines.We propose GPU-aware MPI Neighborhood collective operations with support for AMD and NVIDIA GPU backends and propose optimized designs to provide scalable performance for various communication routines. We evaluate our designs using PETSc structures for scattering from a parallel vector to a parallel vector, scattering from a sequential vector to a parallel vector, and scattering from a parallel vector to a sequential vector using a star forest graph representation implemented with nonblocking MPI neighborhood alltoallv collective operations. We evaluate our neighborhood designs on 64 NVIDIA GPUs on the Lassen system with Infiniband networking, demonstrating30.90% improvement against a GPU implementation utilizing CPU-staging techniques, and 8.25% improvement against GPU-aware point-to-point implementations of the communication pattern. We also evaluate on 64 AMD GPUs on the Spock system with slingshot networking and present 39.52% improvement against the CPU-staging implementation of a neighborhood GPU vector type in PETSc, and 33.25% improvement against GPU-aware point-to-point implementation of the routine.\",\"PeriodicalId\":343684,\"journal\":{\"name\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"252 10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS54959.2023.00070\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00070","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

MPI邻域集体用于涉及进程之间通信分布不均匀(如稀疏通信模式)的非传统集体操作。当可以定义邻居关系时,它们提供了定义所涉及的通信模式的灵活性。PETSc是可移植、可扩展的科学计算工具包,广泛用于科学应用程序,通过偏微分方程建模的例程提供可扩展的解决方案,利用邻域通信模式定义各种结构和例程。我们提出了支持AMD和NVIDIA GPU后端的GPU感知MPI Neighborhood集体操作,并提出了优化设计,为各种通信例程提供可扩展的性能。我们使用PETSc结构评估了从平行向量到平行向量的散射,从顺序向量到平行向量的散射,以及使用非阻塞MPI邻域alltoallv集体操作实现的星林图表示从平行向量到顺序向量的散射。我们用Infiniband网络在Lassen系统上的64个NVIDIA GPU上评估了我们的邻域设计,与使用cpu分级技术的GPU实现相比,证明了30.90%的改进,与使用GPU感知的点对点通信模式实现相比,改进了8.25%。我们还在Spock系统上使用slingshot网络对64 AMD GPU进行了评估,与PETSc中邻域GPU矢量类型的cpu分级实现相比,改进了39.52%,与GPU感知点对点实现相比,改进了33.25%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc*
MPI Neighborhood collectives are used for non-traditional collective operations involving uneven distribution of communication amongst processes such as sparse communication patterns. They provide flexibility to define the communication pattern involved when a neighborhood relationship can be defined. PETSc, the Portable, Extensible Toolkit for Scientific Computation, used extensively with scientific applications to provide scalable solutions through routines modeled by partial differential equations, utilizes neighborhood communication patterns to define various structures and routines.We propose GPU-aware MPI Neighborhood collective operations with support for AMD and NVIDIA GPU backends and propose optimized designs to provide scalable performance for various communication routines. We evaluate our designs using PETSc structures for scattering from a parallel vector to a parallel vector, scattering from a sequential vector to a parallel vector, and scattering from a parallel vector to a sequential vector using a star forest graph representation implemented with nonblocking MPI neighborhood alltoallv collective operations. We evaluate our neighborhood designs on 64 NVIDIA GPUs on the Lassen system with Infiniband networking, demonstrating30.90% improvement against a GPU implementation utilizing CPU-staging techniques, and 8.25% improvement against GPU-aware point-to-point implementations of the communication pattern. We also evaluate on 64 AMD GPUs on the Spock system with slingshot networking and present 39.52% improvement against the CPU-staging implementation of a neighborhood GPU vector type in PETSc, and 33.25% improvement against GPU-aware point-to-point implementation of the routine.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations Generalizable Reinforcement Learning-Based Coarsening Model for Resource Allocation over Large and Diverse Stream Processing Graphs Smart Redbelly Blockchain: Reducing Congestion for Web3 QoS-Aware and Cost-Efficient Dynamic Resource Allocation for Serverless ML Workflows Fast Sparse GPU Kernels for Accelerated Training of Graph Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1