Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI:10.1109/IPDPS47924.2020.00014

J. Hashmi, Shulei Xu, B. Ramesh, Mohammadreza Bayatpour, H. Subramoni, D. Panda

{"title":"Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures","authors":"J. Hashmi, Shulei Xu, B. Ramesh, Mohammadreza Bayatpour, H. Subramoni, D. Panda","doi":"10.1109/IPDPS47924.2020.00014","DOIUrl":null,"url":null,"abstract":"Modern multi-/many-cores offer higher core-density, hardware multi-threading, deeper memory hierarchies, and diverse architectural capabilities. While emerging cloud-based HPC systems are able to deliver near-native performance, they bring more diversity to the architectures. The Message Passing Interface (MPI) offers the flexibility to arbitrarily bind application processes to CPU cores, however the static nature of these binding policies typically does not take applications’ communication patterns and underlying machine architecture into consideration. This lack of association between the dynamic nature of applications and architectural diversity offered by modern processors makes it difficult for the application developers and MPI designers to exploit modern multi-/many-core systems to their full potential. In this paper, we propose a set of low-level benchmarking based approaches and MPI-level designs to infer vendor-specific machine characteristics e.g., physical to virtual machine topologies, and dynamic communication patterns of the applications. By utilizing this information, we propose two novel algorithms to construct efficient MPI mappings for any given architecture and application communication pattern. The proposed designs are implemented in the MVAPICH2 MPI library and are evaluated on three different architectures using various micro-benchmarks and application kernels. We demonstrate up to 2X performance improvement for MPI collectives, and up to 3.5X and 26% improvement for NAS-CG and miniAMR application kernels, respectively.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"32-41"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS47924.2020.00014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Modern multi-/many-cores offer higher core-density, hardware multi-threading, deeper memory hierarchies, and diverse architectural capabilities. While emerging cloud-based HPC systems are able to deliver near-native performance, they bring more diversity to the architectures. The Message Passing Interface (MPI) offers the flexibility to arbitrarily bind application processes to CPU cores, however the static nature of these binding policies typically does not take applications’ communication patterns and underlying machine architecture into consideration. This lack of association between the dynamic nature of applications and architectural diversity offered by modern processors makes it difficult for the application developers and MPI designers to exploit modern multi-/many-core systems to their full potential. In this paper, we propose a set of low-level benchmarking based approaches and MPI-level designs to infer vendor-specific machine characteristics e.g., physical to virtual machine topologies, and dynamic communication patterns of the applications. By utilizing this information, we propose two novel algorithms to construct efficient MPI mappings for any given architecture and application communication pattern. The proposed designs are implemented in the MVAPICH2 MPI library and are evaluated on three different architectures using various micro-benchmarks and application kernels. We demonstrate up to 2X performance improvement for MPI collectives, and up to 3.5X and 26% improvement for NAS-CG and miniAMR application kernels, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

新兴体系结构上MPI的机器不可知和通信感知设计

现代多核/多核提供更高的核密度、硬件多线程、更深的内存层次结构和多样化的体系结构能力。虽然新兴的基于云的高性能计算系统能够提供接近本地的性能，但它们为架构带来了更多的多样性。消息传递接口(Message Passing Interface, MPI)提供了将应用程序进程任意绑定到CPU内核的灵活性，但是这些绑定策略的静态特性通常不会考虑应用程序的通信模式和底层机器体系结构。应用程序的动态特性与现代处理器提供的体系结构多样性之间缺乏联系，这使得应用程序开发人员和MPI设计人员难以充分利用现代多核/多核系统的潜力。在本文中，我们提出了一组基于低级基准测试的方法和mpi级设计，以推断供应商特定的机器特征，例如，物理到虚拟机的拓扑结构，以及应用程序的动态通信模式。通过利用这些信息，我们提出了两种新的算法来为任何给定的体系结构和应用程序通信模式构建有效的MPI映射。提出的设计在MVAPICH2 MPI库中实现，并使用各种微基准和应用程序内核在三种不同的体系结构上进行了评估。我们展示了MPI组的性能提高了2倍，NAS-CG和miniAMR应用程序内核的性能分别提高了3.5倍和26%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量

期刊最新文献

Asynch-SGBDT: Train Stochastic Gradient Boosting Decision Trees in an Asynchronous Parallel Manner Resilience at Extreme Scale and Connections with Other Domains A Tale of Two C's: Convergence and Composability 12 Ways to Fool the Masses with Irreproducible Results Is Asymptotic Cost Analysis Useful in Developing Practical Parallel Algorithms