Proceedings of the 29th European MPI Users' Group Meeting最新文献

英文中文

Efficient Process Arrival Pattern Aware Collective Communication for Deep Learning 面向深度学习的高效过程到达模式感知集体通信

Proceedings of the 29th European MPI Users' Group Meeting

Pub Date : 2022-09-14 DOI: 10.1145/3555819.3555857

Pedram Alizadeh, A. Sojoodi, Yiltan Hassan Temuçin, A. Afsahi

MPI collective communication operations are used extensively in parallel applications. As such, researchers have been investigating how to improve their performance and scalability to directly impact application performance. Unfortunately, most of these studies are based on the premise that all processes arrive at the collective call simultaneously. A few studies though have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments, significantly affecting the collective performance. Therefore, devising PAP-aware collective algorithms that could improve performance, while challenging, is highly desirable. This paper is along those lines but in the context of Deep Learning (DL) workloads that have become maintstream. This paper presents a brief characterization of collective communications, in particular MPI_Allreduce, in the Horovod distributed Deep Learning framework and shows that the arrival pattern of MPI processes is indeed imbalanced. It then proposes an intra-node shared-memory PAP-aware MPI_Allreduce algorithm for small to medium messages, where the leader process is dynamically chosen based on the arrival time of the processes at each invocation of the collective call. We then propose an intra-node PAP-aware algorithm for large messages that dynamically constructs the reduction schedule at each MPI_Allreduce invocation. Finally, we propose a PAP-aware cluster-wide hierarchical algorithm, which is extended by utilizing our intra-node PAP-aware designs, that imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. The proposed algorithms deliver up to 58% and 17% improvement at the micro-benchmark and Horovod with TensorFlow application over the native algorithms, respectively.

MPI集体通信操作在并行应用中得到广泛应用。因此，研究人员一直在研究如何提高它们的性能和可伸缩性，从而直接影响应用程序的性能。不幸的是，这些研究大多是基于所有过程同时到达集体呼叫的前提。然而，一些研究表明，不平衡过程到达模式(PAP)在现实环境中普遍存在，严重影响了集体绩效。因此，虽然具有挑战性，但设计能够提高性能的pap感知集体算法是非常可取的。本文是沿着这些思路，但在深度学习(DL)工作负载已经成为主流的背景下。本文简要介绍了Horovod分布式深度学习框架中的集体通信，特别是MPI_Allreduce，并表明MPI进程的到达模式确实是不平衡的。然后，针对中小型消息，提出了一个节点内共享内存感知pap的MPI_Allreduce算法，其中根据每次调用集体调用时进程的到达时间动态选择领导进程。然后，我们提出了一种针对大型消息的节点内pap感知算法，该算法在每次MPI_Allreduce调用时动态构建缩减计划。最后，我们提出了一种感知pap的集群级分层算法，该算法通过利用我们的节点内pap感知设计进行扩展，与平面算法相比，由于其层次性，该算法在进程之间施加了更少的数据依赖性。与原生算法相比，本文提出的算法在微基准测试和使用TensorFlow应用的Horovod上分别提高了58%和17%。

{"title":"Efficient Process Arrival Pattern Aware Collective Communication for Deep Learning","authors":"Pedram Alizadeh, A. Sojoodi, Yiltan Hassan Temuçin, A. Afsahi","doi":"10.1145/3555819.3555857","DOIUrl":"https://doi.org/10.1145/3555819.3555857","url":null,"abstract":"MPI collective communication operations are used extensively in parallel applications. As such, researchers have been investigating how to improve their performance and scalability to directly impact application performance. Unfortunately, most of these studies are based on the premise that all processes arrive at the collective call simultaneously. A few studies though have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments, significantly affecting the collective performance. Therefore, devising PAP-aware collective algorithms that could improve performance, while challenging, is highly desirable. This paper is along those lines but in the context of Deep Learning (DL) workloads that have become maintstream. This paper presents a brief characterization of collective communications, in particular MPI_Allreduce, in the Horovod distributed Deep Learning framework and shows that the arrival pattern of MPI processes is indeed imbalanced. It then proposes an intra-node shared-memory PAP-aware MPI_Allreduce algorithm for small to medium messages, where the leader process is dynamically chosen based on the arrival time of the processes at each invocation of the collective call. We then propose an intra-node PAP-aware algorithm for large messages that dynamically constructs the reduction schedule at each MPI_Allreduce invocation. Finally, we propose a PAP-aware cluster-wide hierarchical algorithm, which is extended by utilizing our intra-node PAP-aware designs, that imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. The proposed algorithms deliver up to 58% and 17% improvement at the micro-benchmark and Horovod with TensorFlow application over the native algorithms, respectively.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125396281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Applying on Node Aggregation Methods to MPI Alltoall Collectives: Matrix Block Aggregation Algorithm 节点聚合方法在MPI全局集合中的应用——矩阵块聚合算法

Proceedings of the 29th European MPI Users' Group Meeting

Pub Date : 2022-09-14 DOI: 10.1145/3555819.3555821

G. Chochia, David G. Solt, Joshua Hursey

This paper presents algorithms for all-to-all and all-to-all(v) MPI collectives optimized for small-medium messages and large task counts per node to support multicore CPUs in HPC systems. The complexity of these algorithms is analyzed for two metrics: the number of messages and the volume of data exchanged per task. These algorithms have optimal complexity for the second metric, which is better by a logarithmic factor than that in algorithms designed for short messages, with logarithmic complexity for the first metric. It is shown that the balance between these two metrics is key to achieving optimal performance. The performance advantage of the new algorithm is demonstrated at scale by comparing performance versus logarithmic algorithm implementations in Open MPI and Spectrum MPI. The two-phase design for the all-to-all(v) algorithm is presented. It combines efficient implementations for short and large messages in a single framework which is known to be an issue in logarithmic all-to-all(v) algorithms.

本文提出了所有对所有和所有对所有(v) MPI集合的算法，这些算法针对每个节点的中小型消息和大型任务计数进行了优化，以支持HPC系统中的多核cpu。通过两个指标分析这些算法的复杂性:消息数量和每个任务交换的数据量。这些算法在第二个指标上具有最优的复杂度，它比为短消息设计的算法要好一个对数因子，在第一个指标上具有对数复杂度。结果表明，这两个指标之间的平衡是实现最佳性能的关键。通过比较Open MPI和Spectrum MPI中对数算法实现的性能，可以大规模地证明新算法的性能优势。给出了全对全(v)算法的两阶段设计。它在单个框架中结合了短消息和大消息的有效实现，这是对数全对全(v)算法中已知的一个问题。

引用次数: 1

Enabling Global MPI Process Addressing in MPI Applications 在MPI应用程序中启用全局MPI进程寻址

Proceedings of the 29th European MPI Users' Group Meeting

Pub Date : 2022-09-14 DOI: 10.1145/3555819.3555829

Jean-Baptiste Besnard, S. Shende, A. Malony, Julien Jaeger, Marc Pérache

Distributed software using MPI is now facing a complexity barrier. Indeed, given increasing intra-node parallelism, combined with the use of accelerator, programs’ states are becoming more intricate. A given code must cover several cases, generating work for multiple devices. Model mixing generally leads to increasingly large programs and hinders performance portability. In this paper, we pose the question of software composition, trying to split jobs in multiple services. In doing so, we advocate it would be possible to depend on more suitable units while removing the need for extensive runtime stacking (MPI+X+Y). For this purpose, we discuss what MPI shall provide and what is currently available to enable such software composition. After pinpointing (1) process discovery and (2) Remote Procedure Calls (RPCs) as facilitators in such infrastructure, we focus solely on the first aspect. We introduce an overlay-network providing whole-machine inter-job, discovery, and wiring at the level of the MPI runtime. MPI process Unique IDentifiers (UIDs) are then covered as a Unique Resource Locator (URL) leveraged as support for job interaction in MPI, enabling a more horizontal usage of the MPI interface. Eventually, we present performance results for large-scale wiring-up exchanges, demonstrating gains over PMIx in cross-job configurations.

使用MPI的分布式软件目前面临着复杂性的障碍。事实上，随着节点内并行性的增加，再加上加速器的使用，程序的状态变得越来越复杂。给定的代码必须涵盖几种情况，为多个设备生成工作。模型混合通常会导致越来越大的程序，并阻碍性能可移植性。在本文中，我们提出了软件组合的问题，试图在多个服务中拆分作业。在这样做的过程中，我们主张有可能依赖于更合适的单元，同时消除了大量运行时堆叠(MPI+X+Y)的需要。为此，我们将讨论MPI将提供什么，以及目前可以使用什么来实现这种软件组合。在明确指出(1)进程发现和(2)远程过程调用(rpc)是这种基础设施中的促进器之后，我们只关注第一个方面。我们介绍了一个覆盖网络，在MPI运行时级别提供了整机间的作业、发现和连接。然后将MPI进程唯一标识符(uid)作为唯一资源定位符(URL)覆盖，以支持MPI中的作业交互，从而实现MPI接口的更横向使用。最后，我们展示了大规模连接交换的性能结果，展示了在跨作业配置中优于PMIx的性能。

{"title":"Enabling Global MPI Process Addressing in MPI Applications","authors":"Jean-Baptiste Besnard, S. Shende, A. Malony, Julien Jaeger, Marc Pérache","doi":"10.1145/3555819.3555829","DOIUrl":"https://doi.org/10.1145/3555819.3555829","url":null,"abstract":"Distributed software using MPI is now facing a complexity barrier. Indeed, given increasing intra-node parallelism, combined with the use of accelerator, programs’ states are becoming more intricate. A given code must cover several cases, generating work for multiple devices. Model mixing generally leads to increasingly large programs and hinders performance portability. In this paper, we pose the question of software composition, trying to split jobs in multiple services. In doing so, we advocate it would be possible to depend on more suitable units while removing the need for extensive runtime stacking (MPI+X+Y). For this purpose, we discuss what MPI shall provide and what is currently available to enable such software composition. After pinpointing (1) process discovery and (2) Remote Procedure Calls (RPCs) as facilitators in such infrastructure, we focus solely on the first aspect. We introduce an overlay-network providing whole-machine inter-job, discovery, and wiring at the level of the MPI runtime. MPI process Unique IDentifiers (UIDs) are then covered as a Unique Resource Locator (URL) leveraged as support for job interaction in MPI, enabling a more horizontal usage of the MPI interface. Eventually, we present performance results for large-scale wiring-up exchanges, demonstrating gains over PMIx in cross-job configurations.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128646856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Dynamic Resource Management with MPI Sessions and PMIx 用MPI会话和PMIx实现动态资源管理

Proceedings of the 29th European MPI Users' Group Meeting

Pub Date : 2022-09-14 DOI: 10.1145/3555819.3555856

Dominik Huber, Maximilian Streubel, Isaías Comprés, M. Schulz, Martin Schreiber, H. Pritchard

Job management software on peta- and exascale supercomputers continues to provide static resource allocations, from a program’s start until its end. Dynamic resource allocation and management is a research direction that has the potential to improve the efficiency of HPC systems and applications by dynamically adapting the resources of an application during its runtime. Resources can be adapted based on past, current or even future system conditions and matching optimization targets. However, the implementation of dynamic resource management is challenging as it requires support across many layers of the software stack, including the programming model. In this paper, we focus on the latter and present our approach to extend MPI Sessions to support dynamic resource allocations within MPI applications. While some forms of dynamicity already exist in MPI, it is currently limited by requiring global synchronization, being application or application-domain specific, or by suffering from limited support in current HPC system software stacks. We overcome these limitations with a simple, yet powerful abstraction: resources as process sets, and changes of resources as set operations leading to a graph-based perspective on resource changes. As the main contribution of this work, we provide an implementation of this approach based on MPI Sessions and PMIx. In addition, an illustration of its usage is provided, as well as a discussion about the required extensions of the PMIx standard. We report results based on a prototype implementation with Open MPI using a synthetic application, as well as a PDE solver benchmark on up to four nodes and a total of 112 cores. Overall, our results show the feasibility of our approach, which has only very moderate overheads. We see this first proof-of-concept as an important step towards resource adaptivity based on MPI Sessions.

peta级和exascale级超级计算机上的作业管理软件从程序的开始到结束都持续提供静态资源分配。动态资源分配和管理是一个研究方向，它有可能通过在运行时动态调整应用程序的资源来提高高性能计算系统和应用程序的效率。资源可以根据过去、当前甚至未来的系统条件和匹配的优化目标进行调整。然而，动态资源管理的实现是具有挑战性的，因为它需要跨软件堆栈的许多层的支持，包括编程模型。在本文中，我们重点关注后者，并提出了扩展MPI会话以支持MPI应用程序内动态资源分配的方法。虽然MPI中已经存在某些形式的动态性，但目前由于需要全局同步，特定于应用程序或应用程序领域，或者由于当前HPC系统软件堆栈的有限支持而受到限制。我们通过一个简单而强大的抽象来克服这些限制:将资源作为流程集，将资源的更改作为集操作，从而对资源更改进行基于图的透视图。作为这项工作的主要贡献，我们提供了基于MPI会话和PMIx的这种方法的实现。此外，还提供了它的用法说明，以及关于PMIx标准所需扩展的讨论。我们报告了基于Open MPI使用合成应用程序的原型实现的结果，以及在多达四个节点和总共112个核心上的PDE求解器基准测试。总的来说，我们的结果显示了我们的方法的可行性，它只有非常适度的开销。我们将这第一次概念验证视为迈向基于MPI会话的资源适应性的重要一步。

{"title":"Towards Dynamic Resource Management with MPI Sessions and PMIx","authors":"Dominik Huber, Maximilian Streubel, Isaías Comprés, M. Schulz, Martin Schreiber, H. Pritchard","doi":"10.1145/3555819.3555856","DOIUrl":"https://doi.org/10.1145/3555819.3555856","url":null,"abstract":"Job management software on peta- and exascale supercomputers continues to provide static resource allocations, from a program’s start until its end. Dynamic resource allocation and management is a research direction that has the potential to improve the efficiency of HPC systems and applications by dynamically adapting the resources of an application during its runtime. Resources can be adapted based on past, current or even future system conditions and matching optimization targets. However, the implementation of dynamic resource management is challenging as it requires support across many layers of the software stack, including the programming model. In this paper, we focus on the latter and present our approach to extend MPI Sessions to support dynamic resource allocations within MPI applications. While some forms of dynamicity already exist in MPI, it is currently limited by requiring global synchronization, being application or application-domain specific, or by suffering from limited support in current HPC system software stacks. We overcome these limitations with a simple, yet powerful abstraction: resources as process sets, and changes of resources as set operations leading to a graph-based perspective on resource changes. As the main contribution of this work, we provide an implementation of this approach based on MPI Sessions and PMIx. In addition, an illustration of its usage is provided, as well as a discussion about the required extensions of the PMIx standard. We report results based on a prototype implementation with Open MPI using a synthetic application, as well as a PDE solver benchmark on up to four nodes and a total of 112 cores. Overall, our results show the feasibility of our approach, which has only very moderate overheads. We see this first proof-of-concept as an important step towards resource adaptivity based on MPI Sessions.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132222497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Distributed Acceleration of Adhesive Dynamics Simulations 粘性动力学模拟的分布加速度

Proceedings of the 29th European MPI Users' Group Meeting

Pub Date : 2022-09-14 DOI: 10.1145/3555819.3555832

Daniel F. Puleri, Aristotle X. Martin, A. Randles

Cell adhesion plays a critical role in processes ranging from leukocyte migration to cancer cell transport during metastasis. Adhesive cell interactions can occur over large distances in microvessel networks with cells traveling over distances much greater than the length scale of their own diameter. Therefore, biologically relevant investigations necessitate efficient modeling of large field-of-view domains, but current models are limited by simulating such geometries at the sub-micron scale required to model adhesive interactions which greatly increases the computational requirements for even small domain sizes. In this study we introduce a hybrid scheme reliant on both on-node and distributed parallelism to accelerate a fully deformable adhesive dynamics cell model. This scheme leads to performant system usage of modern supercomputers which use a many-core per-node architecture. On-node acceleration is augmented by a combination of spatial data structures and algorithmic changes to lessen the need for atomic operations. This deformable adhesive cell model accelerated with hybrid parallelization allows us to bridge the gap between high-resolution cell models which can capture the sub-micron adhesive interactions between the cell and its microenvironment, and large-scale fluid-structure interaction (FSI) models which can track cells over considerable distances. By integrating the sub-micron simulation environment into a distributed FSI simulation we enable the study of previously unfeasible research questions involving numerous adhesive cells in microvessel networks such as cancer cell transport through the microcirculation.

细胞粘附在从白细胞迁移到癌细胞转移的过程中起着关键作用。黏附细胞的相互作用可以发生在微血管网络的大距离上，细胞的移动距离远远大于其自身直径的长度尺度。因此，与生物学相关的研究需要对大视场域进行有效的建模，但目前的模型仅限于模拟粘附相互作用所需的亚微米尺度的几何形状，这大大增加了即使是小视场尺寸的计算需求。在这项研究中，我们引入了一种依赖于节点上并行和分布式并行的混合方案来加速完全可变形的粘附动力学单元模型。这种方案导致了使用多核每节点架构的现代超级计算机的高性能系统使用。节点上的加速通过空间数据结构和算法更改的组合来增强，以减少对原子操作的需求。这种可变形的粘附细胞模型通过混合并行化加速，使我们能够弥合高分辨率细胞模型之间的差距，高分辨率细胞模型可以捕获细胞与其微环境之间的亚微米粘附相互作用，而大尺度流体-结构相互作用(FSI)模型可以跟踪相当距离的细胞。通过将亚微米模拟环境集成到分布式FSI模拟中，我们能够研究以前不可行的研究问题，这些问题涉及微血管网络中的许多粘附细胞，例如癌细胞通过微循环的运输。

{"title":"Distributed Acceleration of Adhesive Dynamics Simulations","authors":"Daniel F. Puleri, Aristotle X. Martin, A. Randles","doi":"10.1145/3555819.3555832","DOIUrl":"https://doi.org/10.1145/3555819.3555832","url":null,"abstract":"Cell adhesion plays a critical role in processes ranging from leukocyte migration to cancer cell transport during metastasis. Adhesive cell interactions can occur over large distances in microvessel networks with cells traveling over distances much greater than the length scale of their own diameter. Therefore, biologically relevant investigations necessitate efficient modeling of large field-of-view domains, but current models are limited by simulating such geometries at the sub-micron scale required to model adhesive interactions which greatly increases the computational requirements for even small domain sizes. In this study we introduce a hybrid scheme reliant on both on-node and distributed parallelism to accelerate a fully deformable adhesive dynamics cell model. This scheme leads to performant system usage of modern supercomputers which use a many-core per-node architecture. On-node acceleration is augmented by a combination of spatial data structures and algorithmic changes to lessen the need for atomic operations. This deformable adhesive cell model accelerated with hybrid parallelization allows us to bridge the gap between high-resolution cell models which can capture the sub-micron adhesive interactions between the cell and its microenvironment, and large-scale fluid-structure interaction (FSI) models which can track cells over considerable distances. By integrating the sub-micron simulation environment into a distributed FSI simulation we enable the study of previously unfeasible research questions involving numerous adhesive cells in microvessel networks such as cancer cell transport through the microcirculation.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114341112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards a Hybrid MPI Correctness Benchmark Suite 迈向混合MPI正确性基准测试套件

Proceedings of the 29th European MPI Users' Group Meeting

Pub Date : 2022-09-14 DOI: 10.1145/3555819.3555853

Tim Jammer, Alexander Hück, Jan-Patrick Lehr, Joachim Protze, Simon Schwitanski, Christian H. Bischof

High-performance computing codes often combine the Message-Passing Interface (MPI) with a shared-memory programming model, e.g., OpenMP, for efficient computations. These so-called hybrid models may issue MPI calls concurrently from different threads at the highest level of MPI thread support. The correct use of either MPI or OpenMP can be complex and error-prone. The hybrid model increases this complexity even further. While correctness analysis tools exist for both programming paradigms, for hybrid models, a new set of potential errors exist, whose detection requires combining knowledge of MPI and OpenMP primitives. Unfortunately, correctness tools do not fully support the hybrid model yet, and their current capabilities are also hard to assess. In previous work, to enable structured comparisons of correctness tools and improve their coverage, we proposed the MPI-CorrBench test suite for MPI. Likewise, others proposed the DataRaceBench test suite for OpenMP. However, for the particular error classes of the hybrid model, no such test suite exists. Hence, we propose a hybrid MPI-OpenMP test suite to (1) facilitate the correctness tool development in this area and, subsequently, (2) further encourage the use of the hybrid model at the highest level of MPI thread support. To that end, we discuss issues with this hybrid model and the knowledge correctness tools need to combine w.r.t. MPI and OpenMP to detect these. In our evaluation of two state-of-the-art correctness tools, we see that for most cases of concurrent and conflicting MPI operations, these tools can cope with the added complexity of OpenMP. However, more intricate errors, where user code interferes with MPI, e.g., a data race on a buffer, still evade tool analysis.

高性能计算代码通常将消息传递接口(MPI)与共享内存编程模型(例如OpenMP)相结合，以实现高效计算。这些所谓的混合模型可以从MPI线程支持的最高级别上的不同线程并发地发出MPI调用。正确使用MPI或OpenMP可能很复杂，而且容易出错。混合模型进一步增加了这种复杂性。虽然两种编程范式都有正确性分析工具，但对于混合模型，存在一组新的潜在错误，检测这些错误需要结合MPI和OpenMP原语的知识。不幸的是，正确性工具还不能完全支持混合模型，而且它们当前的功能也很难评估。在之前的工作中，为了实现对正确性工具的结构化比较并提高它们的覆盖率，我们提出了MPI- corbench测试套件。同样，其他人提出了OpenMP的DataRaceBench测试套件。然而，对于混合模型的特定错误类，不存在这样的测试套件。因此，我们提出了一个MPI- openmp混合测试套件来(1)促进这一领域的正确性工具开发，随后，(2)进一步鼓励在MPI线程支持的最高级别使用混合模型。为此，我们讨论了这种混合模型的问题，以及需要结合w.r.t. MPI和OpenMP来检测这些问题的知识正确性工具。在我们对两种最先进的正确性工具的评估中，我们看到，对于并发和冲突的MPI操作的大多数情况，这些工具可以应付OpenMP增加的复杂性。然而，更复杂的错误，用户代码干扰MPI，例如，缓冲区上的数据竞争，仍然逃避工具分析。

{"title":"Towards a Hybrid MPI Correctness Benchmark Suite","authors":"Tim Jammer, Alexander Hück, Jan-Patrick Lehr, Joachim Protze, Simon Schwitanski, Christian H. Bischof","doi":"10.1145/3555819.3555853","DOIUrl":"https://doi.org/10.1145/3555819.3555853","url":null,"abstract":"High-performance computing codes often combine the Message-Passing Interface (MPI) with a shared-memory programming model, e.g., OpenMP, for efficient computations. These so-called hybrid models may issue MPI calls concurrently from different threads at the highest level of MPI thread support. The correct use of either MPI or OpenMP can be complex and error-prone. The hybrid model increases this complexity even further. While correctness analysis tools exist for both programming paradigms, for hybrid models, a new set of potential errors exist, whose detection requires combining knowledge of MPI and OpenMP primitives. Unfortunately, correctness tools do not fully support the hybrid model yet, and their current capabilities are also hard to assess. In previous work, to enable structured comparisons of correctness tools and improve their coverage, we proposed the MPI-CorrBench test suite for MPI. Likewise, others proposed the DataRaceBench test suite for OpenMP. However, for the particular error classes of the hybrid model, no such test suite exists. Hence, we propose a hybrid MPI-OpenMP test suite to (1) facilitate the correctness tool development in this area and, subsequently, (2) further encourage the use of the hybrid model at the highest level of MPI thread support. To that end, we discuss issues with this hybrid model and the knowledge correctness tools need to combine w.r.t. MPI and OpenMP to detect these. In our evaluation of two state-of-the-art correctness tools, we see that for most cases of concurrent and conflicting MPI operations, these tools can cope with the added complexity of OpenMP. However, more intricate errors, where user code interferes with MPI, e.g., a data race on a buffer, still evade tool analysis.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130723441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

MPIX Stream: An Explicit Solution to Hybrid MPI+X Programming MPIX流:混合MPI+X编程的显式解决方案

Proceedings of the 29th European MPI Users' Group Meeting

Pub Date : 2022-08-29 DOI: 10.1145/3555819.3555820

Hui Zhou, Kenneth Raffenetti, Yan-Hua Guo, R. Thakur

The hybrid MPI+X programming paradigm, where X refers to threads or GPUs, has gained prominence in the high-performance computing arena. This corresponds to a trend of system architectures growing more heterogeneous. The current MPI standard only specifies the compatibility levels between MPI and threading runtimes. No MPI concept or interface exists for applications to pass thread context or GPU stream context to MPI implementations explicitly. This lack has made performance optimization complicated in some cases and impossible in other cases. We propose a new concept in MPI, called MPIX stream, to represent the general serial execution context that exists in X runtimes. MPIX streams can be directly mapped to threads or GPU execution streams. Passing thread context into MPI allows implementations to precisely map the execution contexts to network endpoints. Passing GPU execution context into MPI allows implementations to directly operate on GPU streams, lowering the CPU/GPU synchronization cost.

混合MPI+X编程范式(其中X指线程或gpu)在高性能计算领域获得了突出地位。这与系统架构越来越异构的趋势相对应。当前的MPI标准只指定了MPI和线程运行时之间的兼容性级别。没有MPI概念或接口存在，应用程序可以显式地将线程上下文或GPU流上下文传递给MPI实现。这种缺乏使得性能优化在某些情况下变得复杂，而在其他情况下则不可能。我们在MPI中提出了一个新的概念，称为MPIX流，来表示存在于X运行时中的通用串行执行上下文。MPIX流可以直接映射到线程或GPU执行流。将线程上下文传递到MPI允许实现精确地将执行上下文映射到网络端点。将GPU执行上下文传递到MPI允许实现直接操作GPU流，降低CPU/GPU同步成本。

引用次数: 5

A Locality-Aware Bruck Allgather 位置感知Bruck集合

Proceedings of the 29th European MPI Users' Group Meeting

Pub Date : 2022-06-07 DOI: 10.1145/3555819.3555825

Amanda Bienz, Shreemant Gautam, Amun Kharel

Collective algorithms are an essential part of MPI, allowing application programmers to utilize underlying optimizations of common distributed operations. The MPI_Allgather gathers data, which is originally distributed across all processes, so that all data is available to each process. For small data sizes, the Bruck algorithm is commonly implemented to minimize the maximum number of messages communicated by any process. However, the cost of each step of communication is dependent upon the relative locations of source and destination processes, with non-local messages, such as inter-node, significantly more costly than local messages, such as intra-node. This paper optimizes the Bruck algorithm with locality-awareness, minimizing the number and size of non-local messages to improve performance and scalability of the allgather operation.

集合算法是MPI的重要组成部分，它允许应用程序程序员利用常见分布式操作的底层优化。MPI_Allgather收集数据，这些数据最初分布在所有进程中，因此每个进程都可以使用所有数据。对于小数据量，通常实现Bruck算法以最小化任何进程通信的最大消息数。然而，通信的每个步骤的成本取决于源进程和目标进程的相对位置，非本地消息(如节点间)的成本明显高于本地消息(如节点内)。本文利用位置感知对Bruck算法进行优化，最大限度地减少非本地消息的数量和大小，以提高allgather操作的性能和可扩展性。

引用次数: 5

Proceedings of the 29th European MPI Users' Group Meeting 第29届欧洲MPI用户组会议论文集

Proceedings of the 29th European MPI Users' Group Meeting

Pub Date : 2013-09-15 DOI: 10.1145/2488551

J. Dongarra, Javier Garcia Blas, J. Carretero

EuroMPI is the successor to the EuroPVM/MPI user group meeting series (since 2010), making EuroMPI 2013 the 20th event of this kind. EuroMPI takes place each year at a different European location; the 2013 meeting was held in Madrid, Spain, organized jointly by the Computer Architecture and Technology Group (ARCOS). Previous meetings were held in Vienna (2012), Santorini (2011), Stuttgart (2010), Espoo (2009), Dublin (2008), Paris (2007), Bonn (2006), Sorrento (2005), Budapest (2004), Venice (2003), Linz (2002), Santorini (2001), Balatonfured (2000), Barcelona (1999), Liverpool (1998), Cracow (1997), Munich (1996), Lyon (1995), and Rome (1994). The meeting took place during September 15--18, 2013.

EuroMPI是EuroPVM/MPI用户组会议系列(自2010年以来)的继承者，使2013年EuroMPI成为第20届此类活动。EuroMPI每年在欧洲不同的地点举行;2013年会议在西班牙马德里举行，由计算机体系结构与技术集团(ARCOS)联合组织。之前的会议分别在维也纳(2012年)、圣托里尼(2011年)、斯图加特(2010年)、埃斯波(2009年)、都柏林(2008年)、巴黎(2007年)、波恩(2006年)、索伦托(2005年)、布达佩斯(2004年)、威尼斯(2003年)、林茨(2002年)、圣托里尼(2001年)、巴拉通fured(2000年)、巴塞罗那(1999年)、利物浦(1998年)、克拉科夫(1997年)、慕尼黑(1996年)、里昂(1995年)和罗马(1994年)举行。会议于2013年9月15日至18日举行。

引用次数: 0

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 29th European MPI Users' Group Meeting

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀