Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM) Pub Date : 2019-11-01 DOI:10.1109/IPDRM49579.2019.00010

Shulei Xu, J. Hashmi, S. Chakraborty, H. Subramoni, D. Panda

{"title":"Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2","authors":"Shulei Xu, J. Hashmi, S. Chakraborty, H. Subramoni, D. Panda","doi":"10.1109/IPDRM49579.2019.00010","DOIUrl":null,"url":null,"abstract":"Recent advances in processor technologies have led to highly multi-threaded and dense multi- and many-core HPC systems. The adoption of such dense multi-core processors is widespread in the Top500 systems. Message Passing Interface (MPI) has been widely used to scale out scientific applications. The communication designs for intra-node communication in MPI are mainly based on shared memory communication. The increased core-density of modern processors warrants the use of efficient shared memory communication designs to achieve optimal performance. While there have been various algorithms and data-structures proposed for the producer-consumer like scenarios in the literature, there is a need to revisit them in the context of MPI communication on modern architectures to find the optimal solutions that work best for modern architectures. In this paper, we first propose a set of low-level benchmarks to evaluate various data-structures such as Lamport queues, Fast-Forward queues, and Fastboxes (FB) for shared memory communication. Then, we bring these designs into the MVAPICH2 MPI library and measure their impact on the MPI intra-node communication for a wide variety of communication patterns. The benchmarking results are carried out on modern multi-/many-core architectures including Intel Xeon CascadeLake and Intel Knights Landing.","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDRM49579.2019.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advances in processor technologies have led to highly multi-threaded and dense multi- and many-core HPC systems. The adoption of such dense multi-core processors is widespread in the Top500 systems. Message Passing Interface (MPI) has been widely used to scale out scientific applications. The communication designs for intra-node communication in MPI are mainly based on shared memory communication. The increased core-density of modern processors warrants the use of efficient shared memory communication designs to achieve optimal performance. While there have been various algorithms and data-structures proposed for the producer-consumer like scenarios in the literature, there is a need to revisit them in the context of MPI communication on modern architectures to find the optimal solutions that work best for modern architectures. In this paper, we first propose a set of low-level benchmarks to evaluate various data-structures such as Lamport queues, Fast-Forward queues, and Fastboxes (FB) for shared memory communication. Then, we bring these designs into the MVAPICH2 MPI library and measure their impact on the MPI intra-node communication for a wide variety of communication patterns. The benchmarking results are carried out on modern multi-/many-core architectures including Intel Xeon CascadeLake and Intel Knights Landing.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于MVAPICH2的新兴体系结构上共享内存通信的设计与评估

处理器技术的最新进展导致了高度多线程和密集的多核和多核HPC系统。这种密集多核处理器的采用在Top500系统中很普遍。消息传递接口(Message Passing Interface, MPI)被广泛用于科学应用的扩展。MPI节点内通信的通信设计主要基于共享内存通信。现代处理器核密度的提高保证了使用高效的共享内存通信设计来实现最佳性能。虽然文献中已经为类似生产者-消费者的场景提出了各种算法和数据结构，但有必要在现代体系结构上的MPI通信上下文中重新审视它们，以找到最适合现代体系结构的最佳解决方案。在本文中，我们首先提出了一组低级基准来评估各种数据结构，如Lamport队列、Fast-Forward队列和Fastboxes (FB)，用于共享内存通信。然后，我们将这些设计引入MVAPICH2 MPI库，并测量它们对各种通信模式的MPI节点内通信的影响。基准测试结果在现代多核/多核架构上进行，包括Intel Xeon CascadeLake和Intel Knights Landing。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

自引率

0.00%

发文量

期刊最新文献

Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast [Title page] Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2 [Copyright notice] Sequential Codelet Model of Program Execution. A Super-Codelet model based on the Hierarchical Turing Machine.