2016 First International Workshop on Communication Optimizations in HPC (COMHPC)最新文献

英文中文

Topology-Aware Performance Optimization and Modeling of Adaptive Mesh Refinement Codes for Exascale Exascale自适应网格细化代码的拓扑感知性能优化与建模

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

Pub Date : 2016-11-13 DOI: 10.1109/COM-HPC.2016.8

Cy P. Chan, J. Bachan, J. Kenny, Jeremiah J. Wilke, V. Beckner, A. Almgren, J. Bell

We introduce a topology-aware performance optimization and modeling workflow for AMR simulation that includes two new modeling tools, ProgrAMR and Mota Mapper, which interface with the BoxLib AMR framework and the SSTmacro network simulator. ProgrAMR allows us to generate and model the execution of task dependency graphs from high-level specifications of AMR-based applications, which we demonstrate by analyzing two example AMR-based multigrid solvers with varying degrees of asynchrony. Mota Mapper generates multiobjective, network topology-aware box mappings, which we apply to optimize the data layout for the example multigrid solvers. While the sensitivity of these solvers to layout and execution strategy appears to be modest for balanced scenarios, the impact of better mapping algorithms can be significant when performance is highly constrained by network hop latency. Furthermore, we show that network latency in the multigrid bottom solve is the main contributing factor preventing good scaling on exascale-class machines.

我们为AMR仿真引入了一个拓扑感知的性能优化和建模工作流，其中包括两个新的建模工具，ProgrAMR和Mota Mapper，它们与BoxLib AMR框架和SSTmacro网络模拟器接口。ProgrAMR允许我们从基于amr的应用程序的高级规范中生成任务依赖图的执行并对其建模，我们通过分析两个具有不同异步程度的基于amr的多网格求解器示例来演示这一点。Mota Mapper生成多目标、网络拓扑感知的盒子映射，我们将其用于优化示例多网格求解器的数据布局。虽然这些求解器对布局和执行策略的敏感性在平衡场景中似乎是适度的，但当性能受到网络跳延迟的高度限制时，更好的映射算法的影响可能是显著的。此外，我们还表明，在多网格底层求解中，网络延迟是阻碍百亿亿级机器良好扩展的主要因素。

引用次数: 12

Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction 可伸缩的分层聚合协议(SHArP):一种高效数据缩减的硬件架构

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

Pub Date : 2016-11-13 DOI: 10.1109/COM-HPC.2016.6

R. Graham, Devendar Bureddy, Pak Lui, H. Rosenstock, G. Shainer, Gil Bloch, Dror Goldenberg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alexander Margolin, Tamir Ronen, Alexander Shpiner, O. Wertheim, E. Zahavi

Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. As a step towards a new network class of co-processors — intelligent network devices, which manipulate data traversing the data-center network, this paper describes the SHArP technology designed to offload collective operation processing to the network. This is implemented in Mellanox's SwitchIB-2 ASIC, using innetwork trees to reduce data from a group of sources, and to distribute the result. Multiple parallel jobs with several partially overlapping groups are supported each with several reduction operations in-flight. Large performance enhancements are obtained, with an improvement of a factor of 2.1 for an eight byte MPI_Allreduce() operation on 128 hosts, going from 6.01 to 2.83 microseconds. Pipelining is used for an improvement of a factor of 3.24 in the latency of a 4096 byte MPI_Allreduce() operations, declining from 46.93 to 14.48 microseconds.

增加的系统尺寸和更依赖于利用系统并行性来实现计算需求，需要创新的系统架构来满足仿真挑战。作为一种新型的网络类协处理器——智能网络设备——在数据中心网络中操作数据的一步，本文描述了旨在将集体操作处理卸载到网络中的SHArP技术。这是在Mellanox的SwitchIB-2 ASIC中实现的，使用网络树从一组源中减少数据，并分发结果。支持具有几个部分重叠组的多个并行作业，每个作业都有几个正在进行的缩减操作。获得了很大的性能增强，对于128台主机上的8字节MPI_Allreduce()操作，性能提高了2.1倍，从6.01微秒降低到2.83微秒。使用流水线可以将4096字节的MPI_Allreduce()操作的延迟提高3.24倍，从46.93微秒降低到14.48微秒。

{"title":"Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction","authors":"R. Graham, Devendar Bureddy, Pak Lui, H. Rosenstock, G. Shainer, Gil Bloch, Dror Goldenberg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alexander Margolin, Tamir Ronen, Alexander Shpiner, O. Wertheim, E. Zahavi","doi":"10.1109/COM-HPC.2016.6","DOIUrl":"https://doi.org/10.1109/COM-HPC.2016.6","url":null,"abstract":"Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. As a step towards a new network class of co-processors — intelligent network devices, which manipulate data traversing the data-center network, this paper describes the SHArP technology designed to offload collective operation processing to the network. This is implemented in Mellanox's SwitchIB-2 ASIC, using innetwork trees to reduce data from a group of sources, and to distribute the result. Multiple parallel jobs with several partially overlapping groups are supported each with several reduction operations in-flight. Large performance enhancements are obtained, with an improvement of a factor of 2.1 for an eight byte MPI_Allreduce() operation on 128 hosts, going from 6.01 to 2.83 microseconds. Pipelining is used for an improvement of a factor of 3.24 in the latency of a 4096 byte MPI_Allreduce() operations, declining from 46.93 to 14.48 microseconds.","PeriodicalId":332852,"journal":{"name":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120823754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 82

Network Topologies and Inevitable Contention 网络拓扑与不可避免的争用

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

Pub Date : 2016-11-13 DOI: 10.1109/COM-HPC.2016.10

Grey Ballard, J. Demmel, A. Gearhart, Benjamin Lipshitz, Yishai Oltchik, O. Schwartz, Sivan Toledo

Network topologies can have significant effect on the execution costs of parallel algorithms due to inter-processor communication. For particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even if algorithms are optimally designed so that each processor communicates as little as possible. We obtain novel contention lower bounds that are functions of the network and the computation graph parameters. For several combinations of fundamental computations and common network topologies, our new analysis improves upon previous per-processor lower bounds which only specify the number of words communicated by the busiest individual processor. We consider torus and mesh topologies, universal fat-trees, and hypercubes; algorithms covered include classical matrix multiplication and direct numerical linear algebra, fast matrix multiplication algorithms, programs that reference arrays, N-body computations, and the FFT. For example, we show that fast matrix multiplication algorithms (e.g., Strassen's) running on a 3D torus will suffer from contention bottlenecks. On the other hand, this network is likely sufficient for a classical matrix multiplication algorithm. Our new lower bounds are matched by existing algorithms only in very few cases, leaving many open problems for network and algorithmic design.

由于处理器间的通信，网络拓扑结构会对并行算法的执行成本产生重大影响。对于计算和网络拓扑的特定组合，代价高昂的网络争用可能不可避免地成为瓶颈，即使算法经过优化设计，使每个处理器尽可能少地通信。我们得到了新的竞争下界，它是网络和计算图参数的函数。对于基本计算和公共网络拓扑的几种组合，我们的新分析改进了以前的每处理器下限，该下限仅指定最繁忙的单个处理器通信的字数。我们考虑环面和网格拓扑，通用脂肪树和超立方体;涵盖的算法包括经典矩阵乘法和直接数值线性代数、快速矩阵乘法算法、引用数组的程序、n体计算和FFT。例如，我们展示了在3D环面上运行的快速矩阵乘法算法(例如Strassen的)将遭受争用瓶颈。另一方面，这个网络对于经典的矩阵乘法算法来说可能是足够的。我们的新下界仅在极少数情况下与现有算法相匹配，这为网络和算法设计留下了许多开放的问题。

{"title":"Network Topologies and Inevitable Contention","authors":"Grey Ballard, J. Demmel, A. Gearhart, Benjamin Lipshitz, Yishai Oltchik, O. Schwartz, Sivan Toledo","doi":"10.1109/COM-HPC.2016.10","DOIUrl":"https://doi.org/10.1109/COM-HPC.2016.10","url":null,"abstract":"Network topologies can have significant effect on the execution costs of parallel algorithms due to inter-processor communication. For particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even if algorithms are optimally designed so that each processor communicates as little as possible. We obtain novel contention lower bounds that are functions of the network and the computation graph parameters. For several combinations of fundamental computations and common network topologies, our new analysis improves upon previous per-processor lower bounds which only specify the number of words communicated by the busiest individual processor. We consider torus and mesh topologies, universal fat-trees, and hypercubes; algorithms covered include classical matrix multiplication and direct numerical linear algebra, fast matrix multiplication algorithms, programs that reference arrays, N-body computations, and the FFT. For example, we show that fast matrix multiplication algorithms (e.g., Strassen's) running on a 3D torus will suffer from contention bottlenecks. On the other hand, this network is likely sufficient for a classical matrix multiplication algorithm. Our new lower bounds are matched by existing algorithms only in very few cases, leaving many open problems for network and algorithmic design.","PeriodicalId":332852,"journal":{"name":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127557791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers 面向大规模超级计算机密集I/O的拓扑感知数据聚合

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

Pub Date : 2016-11-13 DOI: 10.1109/COM-HPC.2016.13

François Tessier, Preeti Malakar, V. Vishwanath, E. Jeannot, Florin Isaila

Reading and writing data efficiently from storage systems is critical for high performance data-centric applications. These I/O systems are being increasingly characterized by complex topologies and deeper memory hierarchies. Effective parallel I/O solutions are needed to scale applications on current and future supercomputers. Data aggregation is an efficient approach consisting of electing some processes in charge of aggregating data from a set of neighbors and writing the aggregated data into storage. Thus, the bandwidth use can be optimized while the contention is reduced. In this work, we take into account the network topology for mapping aggregators and we propose an optimized buffering system in order to reduce the aggregation cost. We validate our approach using micro-benchmarks and the I/O kernel of a large-scale cosmology simulation. We show improvements up to 15× faster for I/O operations compared to a standard implementation of MPI I/O.

高效地从存储系统读取和写入数据对于高性能数据中心应用程序至关重要。这些I/O系统越来越具有复杂的拓扑结构和更深的内存层次结构的特点。有效的并行I/O解决方案需要在当前和未来的超级计算机上扩展应用程序。数据聚合是一种有效的方法，它包括选择一些负责从一组邻居中聚合数据的进程，并将聚合的数据写入存储。因此，可以在减少争用的同时优化带宽使用。在这项工作中，我们考虑了映射聚合器的网络拓扑结构，并提出了一种优化的缓冲系统，以降低聚合成本。我们使用微基准测试和大规模宇宙学模拟的I/O内核验证了我们的方法。我们展示了与MPI I/O的标准实现相比，I/O操作速度提高了15倍。

引用次数: 20

Topology and Affinity Aware Hierarchical and Distributed Load-Balancing in Charm++ 基于拓扑和亲和性的分层分布式负载均衡

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

Pub Date : 2016-11-13 DOI: 10.1109/COM-HPC.2016.12

E. Jeannot, Guillaume Mercier, François Tessier

The evolution of massively parallel supercomputers make palpable two issues in particular: the load imbalance and the poor management of data locality in applications. Thus, with the increase of the number of cores and the drastic decrease of amount of memory per core, the large performance needs imply to particularly take care of the load-balancing and as much as possible of the locality of data. One mean to take into account this locality issue relies on the placement of the processing entities and load balancing techniques are relevant in order to improve application performance. With large-scale platforms in mind, we developed a hierarchical and distributed algorithm which aim is to perform a topology-aware load balancing tailored for Charm++ applications. This algorithm is based on both LibTopoMap for the network awareness aspects and on TREEMATCH to determine a relevant placement of the processing entities. We show that the proposed algorithm improves the overall execution time in both the cases of real applications and a synthetic benchmark as well. For this last experiment, we show a scalability up to one millions processing entities.

大规模并行超级计算机的发展特别突出了两个问题:负载不平衡和应用程序中数据局部性管理不善。因此，随着核心数量的增加和每个核心内存量的急剧减少，大的性能需要特别注意负载平衡和尽可能多的数据位置。考虑局部性问题的一种方法是依赖于处理实体的位置和与之相关的负载平衡技术，以提高应用程序性能。考虑到大规模平台，我们开发了一种分层和分布式算法，旨在为Charm++应用程序执行拓扑感知负载平衡。该算法基于LibTopoMap(用于网络感知方面)和TREEMATCH(用于确定处理实体的相关位置)。我们证明了所提出的算法在实际应用程序和综合基准测试中都提高了总体执行时间。对于最后一个实验，我们展示了多达一百万个处理实体的可伸缩性。

{"title":"Topology and Affinity Aware Hierarchical and Distributed Load-Balancing in Charm++","authors":"E. Jeannot, Guillaume Mercier, François Tessier","doi":"10.1109/COM-HPC.2016.12","DOIUrl":"https://doi.org/10.1109/COM-HPC.2016.12","url":null,"abstract":"The evolution of massively parallel supercomputers make palpable two issues in particular: the load imbalance and the poor management of data locality in applications. Thus, with the increase of the number of cores and the drastic decrease of amount of memory per core, the large performance needs imply to particularly take care of the load-balancing and as much as possible of the locality of data. One mean to take into account this locality issue relies on the placement of the processing entities and load balancing techniques are relevant in order to improve application performance. With large-scale platforms in mind, we developed a hierarchical and distributed algorithm which aim is to perform a topology-aware load balancing tailored for Charm++ applications. This algorithm is based on both LibTopoMap for the network awareness aspects and on TREEMATCH to determine a relevant placement of the processing entities. We show that the proposed algorithm improves the overall execution time in both the cases of real applications and a synthetic benchmark as well. For this last experiment, we show a scalability up to one millions processing entities.","PeriodicalId":332852,"journal":{"name":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117160004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications 在支持gpu的流媒体应用中，为基于硬件组播的广播提供高效的可靠性支持

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

Pub Date : 2016-11-13 DOI: 10.1109/COM-HPC.2016.9

Ching-Hsiang Chu, Khaled Hamidouche, H. Subramoni, Akshay Venkatesh, B. Elton, D. Panda

Streaming applications, which are data-intensive, have been extensively run on High-Performance Computing (HPC) systems to seek the higher performance and scalability. These applications typically utilize broadcast operations to disseminate in real-time data from a single source to multiple workers, each being a multi-GPU based computing site. State-of-the-art broadcast operations take advantage of InfiniBand (IB) hardware multicast (MCAST) and NVIDIA GPUDirect features to boost inter-node communications performance and scalability. The IB MCAST feature works only with the IB Unreliable Datagram (UD) mechanism and consequently provides unreliable communication for applications. Higher-level libraries and/or runtime environments must handle and provide reliability explicitly. However, handling reliability at that level can be a performance bottleneck for streaming applications. In this paper, we analyze the specific requirements of streaming applications and the performance bottlenecks involved in handling reliability. We show that the traditional Negative-Acknowledgement (NACK) based approach requires the broadcast sender to perform retransmissions for lost packets, degrading streaming throughput. To alleviate this issue, we propose a novel Remote Memory Access (RMA) based scheme to provide high-performance reliability support at the MPI-level. In the proposed scheme, the receivers themselves (as opposed to the sender) retrieve lost packets through RMA operations. Furthermore, we provide an analytical model to illustrate the memory requirements of the proposed RMA-based scheme. Our experimental results show that the proposed scheme introduces nearly no overhead compared to the existing solutions. In a micro-benchmark with injected failures (to simulate unreliable network environments), the proposed scheme shows up to 45% reduction in latency compared to the existing NACK-based scheme. Moreover, with a synthetic streaming benchmark, our design also shows up to a 56% higher broadcast rate compared to the traditional NACK-based scheme on a GPU-dense Cray CS-Storm system with up to 88 NVIDIA K80 GPU cards.

流媒体应用作为数据密集型应用，在高性能计算(HPC)系统上广泛运行，以寻求更高的性能和可扩展性。这些应用程序通常利用广播操作将来自单个源的实时数据传播到多个工作人员，每个工作人员都是基于多gpu的计算站点。最先进的广播操作利用InfiniBand (IB)硬件组播(MCAST)和NVIDIA GPUDirect功能来提高节点间通信性能和可扩展性。IB MCAST特性仅适用于IB不可靠数据报(UD)机制，因此为应用程序提供了不可靠的通信。高级库和/或运行时环境必须显式地处理和提供可靠性。然而，在这个级别上处理可靠性可能会成为流应用程序的性能瓶颈。在本文中，我们分析了流应用的具体需求和处理可靠性所涉及的性能瓶颈。我们表明，传统的基于否定确认(NACK)的方法要求广播发送方对丢失的数据包进行重传，从而降低了流吞吐量。为了解决这个问题，我们提出了一种新的基于远程内存访问(RMA)的方案来提供mpi级别的高性能可靠性支持。在提议的方案中，接收方自己(而不是发送方)通过RMA操作检索丢失的数据包。此外，我们提供了一个分析模型来说明所提出的基于rma的方案的内存需求。实验结果表明，与现有方案相比，该方案几乎没有带来任何开销。在具有注入故障的微基准测试中(模拟不可靠的网络环境)，与现有的基于nack的方案相比，所提出的方案的延迟减少了45%。此外，与传统的基于nack的方案相比，我们的设计在具有多达88个NVIDIA K80 GPU卡的GPU密集的Cray CS-Storm系统上显示出高达56%的广播速率。

{"title":"Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications","authors":"Ching-Hsiang Chu, Khaled Hamidouche, H. Subramoni, Akshay Venkatesh, B. Elton, D. Panda","doi":"10.1109/COM-HPC.2016.9","DOIUrl":"https://doi.org/10.1109/COM-HPC.2016.9","url":null,"abstract":"Streaming applications, which are data-intensive, have been extensively run on High-Performance Computing (HPC) systems to seek the higher performance and scalability. These applications typically utilize broadcast operations to disseminate in real-time data from a single source to multiple workers, each being a multi-GPU based computing site. State-of-the-art broadcast operations take advantage of InfiniBand (IB) hardware multicast (MCAST) and NVIDIA GPUDirect features to boost inter-node communications performance and scalability. The IB MCAST feature works only with the IB Unreliable Datagram (UD) mechanism and consequently provides unreliable communication for applications. Higher-level libraries and/or runtime environments must handle and provide reliability explicitly. However, handling reliability at that level can be a performance bottleneck for streaming applications. In this paper, we analyze the specific requirements of streaming applications and the performance bottlenecks involved in handling reliability. We show that the traditional Negative-Acknowledgement (NACK) based approach requires the broadcast sender to perform retransmissions for lost packets, degrading streaming throughput. To alleviate this issue, we propose a novel Remote Memory Access (RMA) based scheme to provide high-performance reliability support at the MPI-level. In the proposed scheme, the receivers themselves (as opposed to the sender) retrieve lost packets through RMA operations. Furthermore, we provide an analytical model to illustrate the memory requirements of the proposed RMA-based scheme. Our experimental results show that the proposed scheme introduces nearly no overhead compared to the existing solutions. In a micro-benchmark with injected failures (to simulate unreliable network environments), the proposed scheme shows up to 45% reduction in latency compared to the existing NACK-based scheme. Moreover, with a synthetic streaming benchmark, our design also shows up to a 56% higher broadcast rate compared to the traditional NACK-based scheme on a GPU-dense Cray CS-Storm system with up to 88 NVIDIA K80 GPU cards.","PeriodicalId":332852,"journal":{"name":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125149864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Extending a Message Passing Runtime to Support Partitioned, Global Logical Address Spaces 扩展消息传递运行时以支持分区的全局逻辑地址空间

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

Pub Date : 2016-11-13 DOI: 10.5555/3018058.3018060

D. B. Larkins, James Dinan

Partitioned Global Address Space (PGAS) parallel programming models can provide an efficient mechanism for managing shared data stored across multiple nodes in a distributed memory system. However, these models are traditionally directly addressed and, for applications with loosely-structured or sparse data, determining the location of a given data element within a PGAS can incur significant overheads. Applications incur additional overhead from the network latency of lookups from remote location resolution structures. Further, for large data, caching such structures locally incurs space and coherence overheads that can limit scaling. We observe that the matching structures used by implementations of the Message Passing Interface (MPI) establish a separation between incoming data writes and the location where data will be stored. In this work, we investigate extending such structures to add a layer of indirection between incoming data reads and the location from which data will be read, effectively extending PGAS models with logical addressing.

分区全局地址空间(PGAS)并行编程模型为管理分布式内存系统中跨多个节点存储的共享数据提供了一种有效的机制。然而，这些模型传统上是直接处理的，对于具有松散结构或稀疏数据的应用程序，确定PGAS中给定数据元素的位置可能会导致大量开销。应用程序会因远程位置解析结构查找的网络延迟而产生额外的开销。此外，对于大型数据，在本地缓存这样的结构会导致空间和一致性开销，从而限制扩展。我们观察到，消息传递接口(MPI)实现所使用的匹配结构在传入数据写入和数据存储位置之间建立了分离。在这项工作中，我们研究了扩展这样的结构，在传入数据读取和读取数据的位置之间添加一个间接层，有效地扩展了具有逻辑寻址的PGAS模型。

引用次数: 2

DISP: Optimizations towards Scalable MPI Startup 面向可扩展MPI启动的优化

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

Pub Date : 2016-11-13 DOI: 10.1109/COM-HPC.2016.11

Huansong Fu, S. Pophale, Manjunath Gorentla Venkata, Weikuan Yu

Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namely Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.

尽管MPI在高性能计算方面很受欢迎，但MPI程序的启动面临着可伸缩性的挑战，因为执行时间和内存消耗都在大规模地急剧增加。我们使用Cheetah和Tuned in Open MPI的集合模块作为代表性实现来研究这个问题。之前对集体的改进主要集中在算法进步和硬件卸载上。在本文中，我们研究了通信器中集合模块的启动成本，并探索了各种技术来提高其效率和可扩展性。因此，我们开发了一种新的可扩展启动方案，其中包含三种内部技术，即延迟初始化，模块共享和基于预测的拓扑设置(DISP)。我们的DISP方案极大地有利于Cheetah模块的集体初始化。同时，它有助于提高Tuned模块中非集体初始化的性能。我们在ORNL的Titan超级计算机上使用多达4096个进程来评估我们的实现性能。结果表明，延迟初始化可以使tuning和Cheetah的启动速度平均分别提高32.0%和29.2%，模块共享可以使tuning和Cheetah的内存消耗分别降低24.1%和83.5%，基于预测的拓扑设置可以使Cheetah的启动速度提高80%。

{"title":"DISP: Optimizations towards Scalable MPI Startup","authors":"Huansong Fu, S. Pophale, Manjunath Gorentla Venkata, Weikuan Yu","doi":"10.1109/COM-HPC.2016.11","DOIUrl":"https://doi.org/10.1109/COM-HPC.2016.11","url":null,"abstract":"Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namely Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.","PeriodicalId":332852,"journal":{"name":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116594703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀