Kawthar Shafie Khorassani, Chen-Chun Chen, H. Subramoni, D. Panda
{"title":"面向PETSc的gpu感知非阻塞MPI邻域集体通信设计与优化","authors":"Kawthar Shafie Khorassani, Chen-Chun Chen, H. Subramoni, D. Panda","doi":"10.1109/IPDPS54959.2023.00070","DOIUrl":null,"url":null,"abstract":"MPI Neighborhood collectives are used for non-traditional collective operations involving uneven distribution of communication amongst processes such as sparse communication patterns. They provide flexibility to define the communication pattern involved when a neighborhood relationship can be defined. PETSc, the Portable, Extensible Toolkit for Scientific Computation, used extensively with scientific applications to provide scalable solutions through routines modeled by partial differential equations, utilizes neighborhood communication patterns to define various structures and routines.We propose GPU-aware MPI Neighborhood collective operations with support for AMD and NVIDIA GPU backends and propose optimized designs to provide scalable performance for various communication routines. We evaluate our designs using PETSc structures for scattering from a parallel vector to a parallel vector, scattering from a sequential vector to a parallel vector, and scattering from a parallel vector to a sequential vector using a star forest graph representation implemented with nonblocking MPI neighborhood alltoallv collective operations. We evaluate our neighborhood designs on 64 NVIDIA GPUs on the Lassen system with Infiniband networking, demonstrating30.90% improvement against a GPU implementation utilizing CPU-staging techniques, and 8.25% improvement against GPU-aware point-to-point implementations of the communication pattern. We also evaluate on 64 AMD GPUs on the Spock system with slingshot networking and present 39.52% improvement against the CPU-staging implementation of a neighborhood GPU vector type in PETSc, and 33.25% improvement against GPU-aware point-to-point implementation of the routine.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"252 10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc*\",\"authors\":\"Kawthar Shafie Khorassani, Chen-Chun Chen, H. Subramoni, D. Panda\",\"doi\":\"10.1109/IPDPS54959.2023.00070\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MPI Neighborhood collectives are used for non-traditional collective operations involving uneven distribution of communication amongst processes such as sparse communication patterns. They provide flexibility to define the communication pattern involved when a neighborhood relationship can be defined. PETSc, the Portable, Extensible Toolkit for Scientific Computation, used extensively with scientific applications to provide scalable solutions through routines modeled by partial differential equations, utilizes neighborhood communication patterns to define various structures and routines.We propose GPU-aware MPI Neighborhood collective operations with support for AMD and NVIDIA GPU backends and propose optimized designs to provide scalable performance for various communication routines. We evaluate our designs using PETSc structures for scattering from a parallel vector to a parallel vector, scattering from a sequential vector to a parallel vector, and scattering from a parallel vector to a sequential vector using a star forest graph representation implemented with nonblocking MPI neighborhood alltoallv collective operations. We evaluate our neighborhood designs on 64 NVIDIA GPUs on the Lassen system with Infiniband networking, demonstrating30.90% improvement against a GPU implementation utilizing CPU-staging techniques, and 8.25% improvement against GPU-aware point-to-point implementations of the communication pattern. We also evaluate on 64 AMD GPUs on the Spock system with slingshot networking and present 39.52% improvement against the CPU-staging implementation of a neighborhood GPU vector type in PETSc, and 33.25% improvement against GPU-aware point-to-point implementation of the routine.\",\"PeriodicalId\":343684,\"journal\":{\"name\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"252 10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS54959.2023.00070\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00070","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc*
MPI Neighborhood collectives are used for non-traditional collective operations involving uneven distribution of communication amongst processes such as sparse communication patterns. They provide flexibility to define the communication pattern involved when a neighborhood relationship can be defined. PETSc, the Portable, Extensible Toolkit for Scientific Computation, used extensively with scientific applications to provide scalable solutions through routines modeled by partial differential equations, utilizes neighborhood communication patterns to define various structures and routines.We propose GPU-aware MPI Neighborhood collective operations with support for AMD and NVIDIA GPU backends and propose optimized designs to provide scalable performance for various communication routines. We evaluate our designs using PETSc structures for scattering from a parallel vector to a parallel vector, scattering from a sequential vector to a parallel vector, and scattering from a parallel vector to a sequential vector using a star forest graph representation implemented with nonblocking MPI neighborhood alltoallv collective operations. We evaluate our neighborhood designs on 64 NVIDIA GPUs on the Lassen system with Infiniband networking, demonstrating30.90% improvement against a GPU implementation utilizing CPU-staging techniques, and 8.25% improvement against GPU-aware point-to-point implementations of the communication pattern. We also evaluate on 64 AMD GPUs on the Spock system with slingshot networking and present 39.52% improvement against the CPU-staging implementation of a neighborhood GPU vector type in PETSc, and 33.25% improvement against GPU-aware point-to-point implementation of the routine.