Proceedings of the 27th European MPI Users' Group Meeting最新文献

Fibers are not (P)Threads: The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations 纤维不是(P)线程:异步编程模型和MPI通过延续松散耦合的情况

Proceedings of the 27th European MPI Users' Group Meeting

Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416320

Joseph Schuchart, Christoph Niethammer, J. Gracia

Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and non-blocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. Meanwhile, new low-level implementations of light-weight, cooperatively scheduled execution contexts (fibers, aka user-level threads (ULT)) are meant to serve as a basis for higher-level APMs and their integration in MPI implementations has been proposed as a replacement for traditional POSIX thread support to alleviate these challenges. In this paper, we first establish a taxonomy in an attempt to clearly distinguish different concepts in the parallel software stack. We argue that the proposed tight integration of fiber implementations with MPI is neither warranted nor beneficial and instead is detrimental to the goal of MPI being a portable communication abstraction. We propose MPI Continuations as an extension to the MPI standard to provide callback-based notifications on completed operations, leading to a clear separation of concerns by providing a loose coupling mechanism between MPI and APMs. We show that this interface is flexible and interacts well with different APMs, namely OpenMP detached tasks, OmpSs-2, and Argobots.

异步编程模型(APM)正获得越来越多的关注，它允许应用程序向负责协调执行的运行时系统公开可用的并发性。虽然MPI长期以来一直支持多线程通信和非阻塞操作，但它无法充分支持apm，因为在不同模型中正确有效地处理MPI通信仍然是一个挑战。同时，轻量级的、协作调度的执行上下文(纤维，又名用户级线程(ULT))的新的低级实现旨在作为高级apm的基础，并且它们在MPI实现中的集成已经被提议作为传统POSIX线程支持的替代品来缓解这些挑战。在本文中，我们首先建立了一个分类法，试图清楚地区分并行软件堆栈中的不同概念。我们认为，将光纤实现与MPI紧密集成既不合理也无益，反而不利于MPI成为可移植通信抽象的目标。我们建议MPI Continuations作为MPI标准的扩展，在完成操作时提供基于回调的通知，通过在MPI和apm之间提供松耦合机制，实现关注点的清晰分离。我们表明，该接口是灵活的，并与不同的apm，即OpenMP分离任务，OmpSs-2和Argobots很好地交互。

{"title":"Fibers are not (P)Threads: The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations","authors":"Joseph Schuchart, Christoph Niethammer, J. Gracia","doi":"10.1145/3416315.3416320","DOIUrl":"https://doi.org/10.1145/3416315.3416320","url":null,"abstract":"Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and non-blocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. Meanwhile, new low-level implementations of light-weight, cooperatively scheduled execution contexts (fibers, aka user-level threads (ULT)) are meant to serve as a basis for higher-level APMs and their integration in MPI implementations has been proposed as a replacement for traditional POSIX thread support to alleviate these challenges. In this paper, we first establish a taxonomy in an attempt to clearly distinguish different concepts in the parallel software stack. We argue that the proposed tight integration of fiber implementations with MPI is neither warranted nor beneficial and instead is detrimental to the goal of MPI being a portable communication abstraction. We propose MPI Continuations as an extension to the MPI standard to provide callback-based notifications on completed operations, leading to a clear separation of concerns by providing a loose coupling mechanism between MPI and APMs. We show that this interface is flexible and interacts well with different APMs, namely OpenMP detached tasks, OmpSs-2, and Argobots.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115485414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

MPI Detach - Asynchronous Local Completion MPI分离-异步本地完成

Proceedings of the 27th European MPI Users' Group Meeting

Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416323

Joachim Protze, Marc-André Hermanns, A. C. Demiralp, Matthias S. Müller, T. Kuhlen

When aiming for large scale parallel computing, waiting time due to network latency, synchronization, and load imbalance are the primary opponents of high parallel efficiency. A common approach to hide latency with computation is the use of non-blocking communication. In the presence of a consistent load imbalance, synchronization cost is just the visible symptom of the load imbalance. Tasking approaches as in OpenMP, TBB, OmpSs, or C++20 coroutines promise to expose a higher degree of concurrency, which can be distributed on available execution units and significantly increase load balance. Available MPI non-blocking functionality does not integrate seamlessly into such tasking parallelization. In this work, we present a slim extension of the MPI interface to allow seamless integration of non-blocking communication with available concepts of asynchronous execution in OpenMP and C++.

当以大规模并行计算为目标时，由网络延迟、同步和负载不平衡引起的等待时间是高并行效率的主要对手。通过计算隐藏延迟的一种常用方法是使用非阻塞通信。在始终存在负载不平衡的情况下，同步成本只是负载不平衡的可见症状。OpenMP、TBB、omps或c++ 20协同程序中的任务处理方法承诺提供更高程度的并发性，这种并发性可以分布在可用的执行单元上，并显著提高负载平衡。可用的MPI非阻塞功能不能无缝地集成到这种任务并行化中。在这项工作中，我们提出了MPI接口的一个精简扩展，以允许将非阻塞通信与OpenMP和c++中可用的异步执行概念无缝集成。

引用次数: 7

Communication and Timing Issues with MPI Virtualization MPI虚拟化的通信和定时问题

Proceedings of the 27th European MPI Users' Group Meeting

Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416317

Alexandr Nigay, L. Mosimann, Timo Schneider, T. Hoefler

Computation–communication overlap and good load balance are features central to high performance of parallel programs. Unfortunately, achieving them with MPI requires considerably increasing the complexity of user code. Our work contributes to the alternative solution to this problem: using a virtualized MPI implementation. Virtualized MPI implementations diverge from traditional MPI implementations in that they map MPI processes to user-level threads instead of operating-system processes and launch more of them than there are CPU cores in the system. They are capable of providing automatic computation–communication overlap and load balance with little to no changes to pre-existing MPI user code. Our work has uncovered new insights into MPI virtualization: Two new kinds of timers are needed: an MPI-process timer and a CPU-core timer, the same discussion also applies to performance counters and the MPI profiling interface. We also observe an interplay between the degree of CPU oversubscription and the rendezvous communication protocol: we find that the intuitive expectation of only two MPI processes per CPU core being enough to achieve full computation–communication overlap is wrong for the rendezvous protocol—instead, three MPI processes per CPU core are required in that case. Our findings are expected to be applicable to all virtualized MPI implementations as well as to general tasking runtime systems.

计算通信重叠和良好的负载平衡是并行程序高性能的核心特征。不幸的是，使用MPI实现它们需要大大增加用户代码的复杂性。我们的工作促成了这个问题的替代解决方案:使用虚拟化的MPI实现。虚拟化MPI实现与传统MPI实现的不同之处在于，它们将MPI进程映射到用户级线程，而不是操作系统进程，并且启动的MPI进程比系统中的CPU内核还要多。它们能够提供自动的计算通信重叠和负载平衡，几乎不需要更改预先存在的MPI用户代码。我们的工作揭示了对MPI虚拟化的新见解:需要两种新的计时器:MPI进程计时器和cpu核心计时器，同样的讨论也适用于性能计数器和MPI分析接口。我们还观察到CPU超额订阅程度和会合通信协议之间的相互作用:我们发现，对于会合协议来说，每个CPU核心只有两个MPI进程足以实现完全的计算通信重叠的直观期望是错误的——相反，在这种情况下，每个CPU核心需要三个MPI进程。我们的发现预计将适用于所有虚拟化的MPI实现以及一般的任务运行时系统。

{"title":"Communication and Timing Issues with MPI Virtualization","authors":"Alexandr Nigay, L. Mosimann, Timo Schneider, T. Hoefler","doi":"10.1145/3416315.3416317","DOIUrl":"https://doi.org/10.1145/3416315.3416317","url":null,"abstract":"Computation–communication overlap and good load balance are features central to high performance of parallel programs. Unfortunately, achieving them with MPI requires considerably increasing the complexity of user code. Our work contributes to the alternative solution to this problem: using a virtualized MPI implementation. Virtualized MPI implementations diverge from traditional MPI implementations in that they map MPI processes to user-level threads instead of operating-system processes and launch more of them than there are CPU cores in the system. They are capable of providing automatic computation–communication overlap and load balance with little to no changes to pre-existing MPI user code. Our work has uncovered new insights into MPI virtualization: Two new kinds of timers are needed: an MPI-process timer and a CPU-core timer, the same discussion also applies to performance counters and the MPI profiling interface. We also observe an interplay between the degree of CPU oversubscription and the rendezvous communication protocol: we find that the intuitive expectation of only two MPI processes per CPU core being enough to achieve full computation–communication overlap is wrong for the rendezvous protocol—instead, three MPI processes per CPU core are required in that case. Our findings are expected to be applicable to all virtualized MPI implementations as well as to general tasking runtime systems.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122261301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Using Advanced Vector Extensions AVX-512 for MPI Reductions 使用先进的矢量扩展AVX-512的MPI减少

Proceedings of the 27th European MPI Users' Group Meeting

Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416316

Dong Zhong, Qinglei Cao, G. Bosilca, J. Dongarra

As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.

随着高性能计算(HPC)系统规模的不断增长，研究人员致力于探索提高并行性水平以实现最佳性能。现代CPU的设计，包括其分层存储和SIMD/矢量化能力的特点，决定了算法的效率。最近引入的宽向量指令集扩展(AVX和SVE)促使向量化成为提高效率和缩小与峰值性能差距的关键。在本文中，我们提出了一种预定义的MPI约简操作的实现，利用AVX, AVX2和AVX-512本征来提供基于向量的约简操作，并改善这些预定义的MPI约简操作的求解时间。通过这些优化，我们实现了更高的本地计算效率，这直接有利于集体减少的总体成本。对不同场景下的软件栈进行了评估，结果表明该方案具有通用性和高效性。在Intel至强Gold集群上进行的实验表明，AVX-512优化的缩减操作比Open MPI默认的MPI本地缩减实现了10倍的性能优势。

{"title":"Using Advanced Vector Extensions AVX-512 for MPI Reductions","authors":"Dong Zhong, Qinglei Cao, G. Bosilca, J. Dongarra","doi":"10.1145/3416315.3416316","DOIUrl":"https://doi.org/10.1145/3416315.3416316","url":null,"abstract":"As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116098364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Signature Datatypes for Type Correct Collective Operations, Revisited 类型正确集合操作的签名数据类型，重访

Proceedings of the 27th European MPI Users' Group Meeting

Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416324

J. Träff

In order to provide for type correct implementations of applications in MPI that use derived datatypes to describe complex and possibly heterogeneous data layouts, signature datatypes describing the sequence of basic datatypes comprising the complex data layout in a compact manner have often been proposed and used to communicate and store such data in a type correct way. Signature datatypes are particularly useful in implementations of algorithms for collective communication employing pipelining and/or message-combining. We (re)examine the properties that signature datatypes must fulfill, and the properties of the MPI collective interfaces that guarantee the existence of proper signature datatypes. The analysis reveals that MPI_Alltoallw does not have the property, and thus that certain non-trivial, type correct implementations of this operation are not easily possible within MPI itself. We observe that the signature datatype for any derived datatype can be computed in O(n) operations in the number of elements n described by the derived datatype. While this improves on certain earlier approaches, this is still not a satisfactory solution for the cases where large layouts are described by small, derived datatypes. We explain how signature type computation is implemented in a library for advanced datatype programming.

为了提供MPI中使用派生数据类型来描述复杂且可能异构的数据布局的应用程序的类型正确实现，经常提出并使用签名数据类型以紧凑的方式描述包含复杂数据布局的基本数据类型序列，并以类型正确的方式通信和存储此类数据。签名数据类型在实现采用流水线和/或消息组合的集体通信算法时特别有用。我们(重新)检查签名数据类型必须满足的属性，以及保证存在适当签名数据类型的MPI集合接口的属性。分析表明，mpi_alltoallow不具有该属性，因此在MPI本身中不容易实现此操作的某些重要的、类型正确的实现。我们观察到，任何派生数据类型的签名数据类型都可以在派生数据类型描述的元素数n的O(n)次操作中计算出来。虽然这改进了某些早期的方法，但对于由小型派生数据类型描述大型布局的情况，这仍然不是一个令人满意的解决方案。我们将解释如何在用于高级数据类型编程的库中实现签名类型计算。

{"title":"Signature Datatypes for Type Correct Collective Operations, Revisited","authors":"J. Träff","doi":"10.1145/3416315.3416324","DOIUrl":"https://doi.org/10.1145/3416315.3416324","url":null,"abstract":"In order to provide for type correct implementations of applications in MPI that use derived datatypes to describe complex and possibly heterogeneous data layouts, signature datatypes describing the sequence of basic datatypes comprising the complex data layout in a compact manner have often been proposed and used to communicate and store such data in a type correct way. Signature datatypes are particularly useful in implementations of algorithms for collective communication employing pipelining and/or message-combining. We (re)examine the properties that signature datatypes must fulfill, and the properties of the MPI collective interfaces that guarantee the existence of proper signature datatypes. The analysis reveals that MPI_Alltoallw does not have the property, and thus that certain non-trivial, type correct implementations of this operation are not easily possible within MPI itself. We observe that the signature datatype for any derived datatype can be computed in O(n) operations in the number of elements n described by the derived datatype. While this improves on certain earlier approaches, this is still not a satisfactory solution for the cases where large layouts are described by small, derived datatypes. We explain how signature type computation is implemented in a library for advanced datatype programming.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117268958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Evaluating MPI Message Size Summary Statistics 评估MPI消息大小汇总统计

Proceedings of the 27th European MPI Users' Group Meeting

Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416322

Kurt B. Ferreira, Scott Levy

The Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on today’s high-performance computing (HPC) systems. This dominance stems from MPI’s powerful semantics for inter-process communication that has enabled scientists to write applications for simulating important physical phenomena. MPI does not, however, specify how messages and synchronization should be carried out. Those details are typically dependent on low-level architecture details and the message characteristics of the application. Therefore, analyzing an applications MPI usage is critical to tuning MPI’s performance on a particular platform. The results of this analysis is typically a discussion of average message sizes for a workload or set of workloads. While a discussion of the message average might be the most intuitive summary statistic, it might not be the most useful in terms of representing the entire message size dataset for an application. Using a previously developed MPI trace collector, we analyze the MPI message traces for a number of key MPI workloads. Through this analysis, we demonstrate that the average, while easy and efficient to calculate, may not be a good representation of all subsets of application messages sizes, with median and mode of message sizes being a superior choice in most cases. We show that the problem with using the average relate to the multi-modal nature of the distribution of point-to-point messages. Finally, we show that while scaling a workload has little discernible impact on which measures of central tendency are representative of the underlying data, different input descriptions can significantly impact which metric is most effective. The results and analysis in this paper have the potential for providing valuable guidance on how we as a community should discuss and analyze MPI message data for scientific applications.

消息传递接口(MPI)仍然是运行在当今高性能计算(HPC)系统上的科学应用程序的主要编程模型。这种优势源于MPI强大的进程间通信语义，它使科学家能够编写用于模拟重要物理现象的应用程序。但是，MPI没有指定应该如何执行消息和同步。这些细节通常依赖于底层架构细节和应用程序的消息特征。因此，分析应用程序的MPI使用情况对于在特定平台上调优MPI的性能至关重要。这种分析的结果通常是对一个工作负载或一组工作负载的平均消息大小的讨论。虽然对消息平均值的讨论可能是最直观的汇总统计，但就表示应用程序的整个消息大小数据集而言，它可能不是最有用的。使用以前开发的MPI跟踪收集器，我们分析了许多关键MPI工作负载的MPI消息跟踪。通过此分析，我们证明了平均值虽然易于计算且有效，但可能不是应用程序消息大小的所有子集的良好表示，在大多数情况下，消息大小的中位数和模式是更好的选择。我们表明，使用平均值的问题与点对点消息分布的多模态性质有关。最后，我们表明，虽然扩展工作负载对集中趋势的哪些度量代表底层数据几乎没有明显的影响，但不同的输入描述可以显著影响哪个度量最有效。本文的结果和分析有可能为我们作为一个社区应该如何讨论和分析MPI信息数据以用于科学应用提供有价值的指导。

{"title":"Evaluating MPI Message Size Summary Statistics","authors":"Kurt B. Ferreira, Scott Levy","doi":"10.1145/3416315.3416322","DOIUrl":"https://doi.org/10.1145/3416315.3416322","url":null,"abstract":"The Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on today’s high-performance computing (HPC) systems. This dominance stems from MPI’s powerful semantics for inter-process communication that has enabled scientists to write applications for simulating important physical phenomena. MPI does not, however, specify how messages and synchronization should be carried out. Those details are typically dependent on low-level architecture details and the message characteristics of the application. Therefore, analyzing an applications MPI usage is critical to tuning MPI’s performance on a particular platform. The results of this analysis is typically a discussion of average message sizes for a workload or set of workloads. While a discussion of the message average might be the most intuitive summary statistic, it might not be the most useful in terms of representing the entire message size dataset for an application. Using a previously developed MPI trace collector, we analyze the MPI message traces for a number of key MPI workloads. Through this analysis, we demonstrate that the average, while easy and efficient to calculate, may not be a good representation of all subsets of application messages sizes, with median and mode of message sizes being a superior choice in most cases. We show that the problem with using the average relate to the multi-modal nature of the distribution of point-to-point messages. Finally, we show that while scaling a workload has little discernible impact on which measures of central tendency are representative of the underlying data, different input descriptions can significantly impact which metric is most effective. The results and analysis in this paper have the potential for providing valuable guidance on how we as a community should discuss and analyze MPI message data for scientific applications.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"34 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130478903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Collectives and Communicators: A Case for Orthogonality: (Or: How to get rid of MPI neighbor and enhance Cartesian collectives) 集体与传播者:正交性的一种情况(或:如何摆脱MPI邻居，增强笛卡尔集体)

Proceedings of the 27th European MPI Users' Group Meeting

Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416319

J. Träff, S. Hunold, Guillaume Mercier, Daniel J. Holmes

A major reason for the success of MPI as the standard for large-scale, distributed memory programming is the economy and orthogonality of key concepts. These very design principles suggest leaner and better support for stencil-like, sparse collective communication, while at the same time reducing significantly the number of concrete operation interfaces, extending the functionality that can be supported by high-quality MPI implementations, and provisioning for possible future, much more wide-ranging functionality. As a starting point for discussion, we suggest to (re)define communicators as the sole carriers of the topological structure over processes that determines the semantics of the collective operations, and to limit the functions that can associate topological information with communicators to the functions for distributed graph topology and inter-communicator creation. As a consequence, one set of interfaces for collective communication operations (in blocking, non-blocking, and persistent variants) will suffice, explicitly eliminating the MPI_Neighbor_ interfaces (in all variants) from the MPI standard. Topological structure will not be implied by Cartesian communicators, which in turn will have the sole function of naming processes in a (d-dimensional, Euclidean) geometric space. The geometric naming can be passed to the topology creating functions as part of the communicator, and be used for the process reordering and topological collective algorithm selection. Concretely, at the price of only 1 essential, additional function, our suggestion can remove 10(+1) function interfaces from MPI-3, and 15 (or more) from MPI-4, while providing vastly more optimization scope for the MPI library implementation.

MPI作为大规模分布式内存编程的标准取得成功的一个主要原因是关键概念的经济性和正交性。这些设计原则建议对类似模板的稀疏集体通信提供更精简和更好的支持，同时显著减少具体操作接口的数量，扩展高质量MPI实现所支持的功能，并为可能的未来提供更广泛的功能。作为讨论的起点，我们建议(重新)将通信器定义为确定集合操作语义的过程上拓扑结构的唯一载体，并将可以将拓扑信息与通信器关联的功能限制为分布式图拓扑和内部通信器创建的功能。因此，一组用于集体通信操作的接口(在阻塞、非阻塞和持久变体中)就足够了，从MPI标准中显式地消除了MPI_Neighbor_接口(在所有变体中)。笛卡尔通信器不会暗示拓扑结构，而笛卡尔通信器的唯一功能是在(d维，欧几里得)几何空间中命名过程。几何命名可以作为通信器的一部分传递给拓扑创建功能，并用于过程重新排序和拓扑集体算法的选择。具体地说，我们的建议可以从MPI-3中删除10(+1)个函数接口，从MPI-4中删除15(或更多)个函数接口，同时为MPI库实现提供更大的优化范围。

{"title":"Collectives and Communicators: A Case for Orthogonality: (Or: How to get rid of MPI neighbor and enhance Cartesian collectives)","authors":"J. Träff, S. Hunold, Guillaume Mercier, Daniel J. Holmes","doi":"10.1145/3416315.3416319","DOIUrl":"https://doi.org/10.1145/3416315.3416319","url":null,"abstract":"A major reason for the success of MPI as the standard for large-scale, distributed memory programming is the economy and orthogonality of key concepts. These very design principles suggest leaner and better support for stencil-like, sparse collective communication, while at the same time reducing significantly the number of concrete operation interfaces, extending the functionality that can be supported by high-quality MPI implementations, and provisioning for possible future, much more wide-ranging functionality. As a starting point for discussion, we suggest to (re)define communicators as the sole carriers of the topological structure over processes that determines the semantics of the collective operations, and to limit the functions that can associate topological information with communicators to the functions for distributed graph topology and inter-communicator creation. As a consequence, one set of interfaces for collective communication operations (in blocking, non-blocking, and persistent variants) will suffice, explicitly eliminating the MPI_Neighbor_ interfaces (in all variants) from the MPI standard. Topological structure will not be implied by Cartesian communicators, which in turn will have the sole function of naming processes in a (d-dimensional, Euclidean) geometric space. The geometric naming can be passed to the topology creating functions as part of the communicator, and be used for the process reordering and topological collective algorithm selection. Concretely, at the price of only 1 essential, additional function, our suggestion can remove 10(+1) function interfaces from MPI-3, and 15 (or more) from MPI-4, while providing vastly more optimization scope for the MPI library implementation.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123677837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Why is MPI (perceived to be) so complex?: Part 1—Does strong progress simplify MPI? 为什么MPI(被认为)如此复杂?第1部分:强进步是否简化了MPI?

Proceedings of the 27th European MPI Users' Group Meeting

Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416318

Daniel J. Holmes, A. Skjellum, Derek Schafer

Strong progress is optional in MPI. MPI allows implementations where progress (for example, updating the message-transport state machines or interaction with network devices) is only made during certain MPI procedure calls. Generally speaking, strong progress implies the ability to achieve progress (to transport data through the network from senders to receivers and exchange protocol messages) without explicit calls from user processes to MPI procedures. For instance, data given to a send procedure that matches a pre-posted receive on the receiving process is moved from source to destination in due course regardless of how often (including zero times) the sender or receiver processes call MPI in the meantime. Further, nonblocking operations and persistent collective operations work ‘in the background’ of user processes once all processes in the communicator’s group have performed the starting step for the operation. Overall, strong progress is meant to enhance the potential for overlap of communication and computation and improve predictability of procedure execution times by eliminating progress effort from user threads. This paper posits that strong progress is desirable as an MPI implementation property and examines whether strong progress: This paper explores such possibilities and sets forth principles that underpin MPI and interactions with normal and fault modes of operation. The key contribution of this paper is the conclusion that, whether measured by absolute performance, by performance portability, or by interface simplicity, strong progress in MPI is no worse than weak progress and, in most scenarios, has more potential to fulfil the aforementioned desirable attributes.

在MPI中，强劲的进步是可选的。MPI允许仅在某些MPI过程调用期间进行进程(例如，更新消息传输状态机或与网络设备交互)的实现。一般来说，强进度意味着无需用户进程向MPI过程进行显式调用就可以实现进度(通过网络将数据从发送方传输到接收方并交换协议消息)。例如，给定给发送过程的与接收进程上预先发布的接收相匹配的数据将在适当的时候从源移动到目的地，而不管发送方或接收方进程在此期间调用MPI的频率(包括零次)。此外，一旦通信器组中的所有进程都执行了操作的开始步骤，非阻塞操作和持久集体操作就会在用户进程的“后台”工作。总的来说，强进度意味着增强通信和计算重叠的可能性，并通过消除用户线程的进度工作来提高过程执行时间的可预测性。本文假设强进度作为MPI的实现属性是可取的，并检查了是否强进度:本文探讨了这种可能性，并阐述了支撑MPI以及与正常和故障操作模式相互作用的原则。本文的关键贡献是得出结论，无论是从绝对性能、性能可移植性还是接口简单性来衡量，MPI的强进展并不比弱进展差，并且在大多数情况下，更有可能实现上述理想的属性。

{"title":"Why is MPI (perceived to be) so complex?: Part 1—Does strong progress simplify MPI?","authors":"Daniel J. Holmes, A. Skjellum, Derek Schafer","doi":"10.1145/3416315.3416318","DOIUrl":"https://doi.org/10.1145/3416315.3416318","url":null,"abstract":"Strong progress is optional in MPI. MPI allows implementations where progress (for example, updating the message-transport state machines or interaction with network devices) is only made during certain MPI procedure calls. Generally speaking, strong progress implies the ability to achieve progress (to transport data through the network from senders to receivers and exchange protocol messages) without explicit calls from user processes to MPI procedures. For instance, data given to a send procedure that matches a pre-posted receive on the receiving process is moved from source to destination in due course regardless of how often (including zero times) the sender or receiver processes call MPI in the meantime. Further, nonblocking operations and persistent collective operations work ‘in the background’ of user processes once all processes in the communicator’s group have performed the starting step for the operation. Overall, strong progress is meant to enhance the potential for overlap of communication and computation and improve predictability of procedure execution times by eliminating progress effort from user threads. This paper posits that strong progress is desirable as an MPI implementation property and examines whether strong progress: This paper explores such possibilities and sets forth principles that underpin MPI and interactions with normal and fault modes of operation. The key contribution of this paper is the conclusion that, whether measured by absolute performance, by performance portability, or by interface simplicity, strong progress in MPI is no worse than weak progress and, in most scenarios, has more potential to fulfil the aforementioned desirable attributes.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"245 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124701215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Implementation and performance evaluation of MPI persistent collectives in MPC: a case study MPI持续性集体在MPC的实施与绩效评估:个案研究

Proceedings of the 27th European MPI Users' Group Meeting

Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416321

Stéphane Bouhrour, Julien Jaeger

Persistent collective communications have recently been voted in the MPI standard, opening the door to many optimizations to reduce collectives cost, in particular for recurring operations. Indeed persistent semantics contains an initialization phase called only once for a specific collective. It can be used to collect building costs necessary to the collective, to avoid paying them each time the operation is performed. We propose an overview of the implementation of the persistent collectives in the MPC MPI runtime. We first present a naïve implementation for MPI runtimes already providing nonblocking collectives. Then, we improve this first implementation with two levels of caching optimizations. We present the performance results of the naïve and optimized versions and discuss their impact on different collective algorithms. We observe performance improvement compared to the naïve version on a repetitive benchmark, up to a 3x speedup for the reduce collective.

持久集体通信最近已被纳入MPI标准，这为许多优化打开了大门，以降低集体成本，特别是对于重复性操作。实际上，持久语义包含一个初始化阶段，对于特定的集合只调用一次。它可以用来收集集体所需的建筑成本，以避免每次执行操作时支付费用。我们概述了MPC MPI运行时中持久集合的实现。我们首先为已经提供非阻塞集合的MPI运行时提供naïve实现。然后，我们通过两个级别的缓存优化来改进第一个实现。我们给出了naïve和优化版本的性能结果，并讨论了它们对不同集合算法的影响。在重复的基准测试中，我们观察到与naïve版本相比，性能有所提高，减少集合的速度提高了3倍。

引用次数: 1