Pedram Alizadeh, A. Sojoodi, Yiltan Hassan Temuçin, A. Afsahi
MPI collective communication operations are used extensively in parallel applications. As such, researchers have been investigating how to improve their performance and scalability to directly impact application performance. Unfortunately, most of these studies are based on the premise that all processes arrive at the collective call simultaneously. A few studies though have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments, significantly affecting the collective performance. Therefore, devising PAP-aware collective algorithms that could improve performance, while challenging, is highly desirable. This paper is along those lines but in the context of Deep Learning (DL) workloads that have become maintstream. This paper presents a brief characterization of collective communications, in particular MPI_Allreduce, in the Horovod distributed Deep Learning framework and shows that the arrival pattern of MPI processes is indeed imbalanced. It then proposes an intra-node shared-memory PAP-aware MPI_Allreduce algorithm for small to medium messages, where the leader process is dynamically chosen based on the arrival time of the processes at each invocation of the collective call. We then propose an intra-node PAP-aware algorithm for large messages that dynamically constructs the reduction schedule at each MPI_Allreduce invocation. Finally, we propose a PAP-aware cluster-wide hierarchical algorithm, which is extended by utilizing our intra-node PAP-aware designs, that imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. The proposed algorithms deliver up to 58% and 17% improvement at the micro-benchmark and Horovod with TensorFlow application over the native algorithms, respectively.
{"title":"Efficient Process Arrival Pattern Aware Collective Communication for Deep Learning","authors":"Pedram Alizadeh, A. Sojoodi, Yiltan Hassan Temuçin, A. Afsahi","doi":"10.1145/3555819.3555857","DOIUrl":"https://doi.org/10.1145/3555819.3555857","url":null,"abstract":"MPI collective communication operations are used extensively in parallel applications. As such, researchers have been investigating how to improve their performance and scalability to directly impact application performance. Unfortunately, most of these studies are based on the premise that all processes arrive at the collective call simultaneously. A few studies though have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments, significantly affecting the collective performance. Therefore, devising PAP-aware collective algorithms that could improve performance, while challenging, is highly desirable. This paper is along those lines but in the context of Deep Learning (DL) workloads that have become maintstream. This paper presents a brief characterization of collective communications, in particular MPI_Allreduce, in the Horovod distributed Deep Learning framework and shows that the arrival pattern of MPI processes is indeed imbalanced. It then proposes an intra-node shared-memory PAP-aware MPI_Allreduce algorithm for small to medium messages, where the leader process is dynamically chosen based on the arrival time of the processes at each invocation of the collective call. We then propose an intra-node PAP-aware algorithm for large messages that dynamically constructs the reduction schedule at each MPI_Allreduce invocation. Finally, we propose a PAP-aware cluster-wide hierarchical algorithm, which is extended by utilizing our intra-node PAP-aware designs, that imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. The proposed algorithms deliver up to 58% and 17% improvement at the micro-benchmark and Horovod with TensorFlow application over the native algorithms, respectively.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125396281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents algorithms for all-to-all and all-to-all(v) MPI collectives optimized for small-medium messages and large task counts per node to support multicore CPUs in HPC systems. The complexity of these algorithms is analyzed for two metrics: the number of messages and the volume of data exchanged per task. These algorithms have optimal complexity for the second metric, which is better by a logarithmic factor than that in algorithms designed for short messages, with logarithmic complexity for the first metric. It is shown that the balance between these two metrics is key to achieving optimal performance. The performance advantage of the new algorithm is demonstrated at scale by comparing performance versus logarithmic algorithm implementations in Open MPI and Spectrum MPI. The two-phase design for the all-to-all(v) algorithm is presented. It combines efficient implementations for short and large messages in a single framework which is known to be an issue in logarithmic all-to-all(v) algorithms.
{"title":"Applying on Node Aggregation Methods to MPI Alltoall Collectives: Matrix Block Aggregation Algorithm","authors":"G. Chochia, David G. Solt, Joshua Hursey","doi":"10.1145/3555819.3555821","DOIUrl":"https://doi.org/10.1145/3555819.3555821","url":null,"abstract":"This paper presents algorithms for all-to-all and all-to-all(v) MPI collectives optimized for small-medium messages and large task counts per node to support multicore CPUs in HPC systems. The complexity of these algorithms is analyzed for two metrics: the number of messages and the volume of data exchanged per task. These algorithms have optimal complexity for the second metric, which is better by a logarithmic factor than that in algorithms designed for short messages, with logarithmic complexity for the first metric. It is shown that the balance between these two metrics is key to achieving optimal performance. The performance advantage of the new algorithm is demonstrated at scale by comparing performance versus logarithmic algorithm implementations in Open MPI and Spectrum MPI. The two-phase design for the all-to-all(v) algorithm is presented. It combines efficient implementations for short and large messages in a single framework which is known to be an issue in logarithmic all-to-all(v) algorithms.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115777834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean-Baptiste Besnard, S. Shende, A. Malony, Julien Jaeger, Marc Pérache
Distributed software using MPI is now facing a complexity barrier. Indeed, given increasing intra-node parallelism, combined with the use of accelerator, programs’ states are becoming more intricate. A given code must cover several cases, generating work for multiple devices. Model mixing generally leads to increasingly large programs and hinders performance portability. In this paper, we pose the question of software composition, trying to split jobs in multiple services. In doing so, we advocate it would be possible to depend on more suitable units while removing the need for extensive runtime stacking (MPI+X+Y). For this purpose, we discuss what MPI shall provide and what is currently available to enable such software composition. After pinpointing (1) process discovery and (2) Remote Procedure Calls (RPCs) as facilitators in such infrastructure, we focus solely on the first aspect. We introduce an overlay-network providing whole-machine inter-job, discovery, and wiring at the level of the MPI runtime. MPI process Unique IDentifiers (UIDs) are then covered as a Unique Resource Locator (URL) leveraged as support for job interaction in MPI, enabling a more horizontal usage of the MPI interface. Eventually, we present performance results for large-scale wiring-up exchanges, demonstrating gains over PMIx in cross-job configurations.
{"title":"Enabling Global MPI Process Addressing in MPI Applications","authors":"Jean-Baptiste Besnard, S. Shende, A. Malony, Julien Jaeger, Marc Pérache","doi":"10.1145/3555819.3555829","DOIUrl":"https://doi.org/10.1145/3555819.3555829","url":null,"abstract":"Distributed software using MPI is now facing a complexity barrier. Indeed, given increasing intra-node parallelism, combined with the use of accelerator, programs’ states are becoming more intricate. A given code must cover several cases, generating work for multiple devices. Model mixing generally leads to increasingly large programs and hinders performance portability. In this paper, we pose the question of software composition, trying to split jobs in multiple services. In doing so, we advocate it would be possible to depend on more suitable units while removing the need for extensive runtime stacking (MPI+X+Y). For this purpose, we discuss what MPI shall provide and what is currently available to enable such software composition. After pinpointing (1) process discovery and (2) Remote Procedure Calls (RPCs) as facilitators in such infrastructure, we focus solely on the first aspect. We introduce an overlay-network providing whole-machine inter-job, discovery, and wiring at the level of the MPI runtime. MPI process Unique IDentifiers (UIDs) are then covered as a Unique Resource Locator (URL) leveraged as support for job interaction in MPI, enabling a more horizontal usage of the MPI interface. Eventually, we present performance results for large-scale wiring-up exchanges, demonstrating gains over PMIx in cross-job configurations.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128646856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dominik Huber, Maximilian Streubel, Isaías Comprés, M. Schulz, Martin Schreiber, H. Pritchard
Job management software on peta- and exascale supercomputers continues to provide static resource allocations, from a program’s start until its end. Dynamic resource allocation and management is a research direction that has the potential to improve the efficiency of HPC systems and applications by dynamically adapting the resources of an application during its runtime. Resources can be adapted based on past, current or even future system conditions and matching optimization targets. However, the implementation of dynamic resource management is challenging as it requires support across many layers of the software stack, including the programming model. In this paper, we focus on the latter and present our approach to extend MPI Sessions to support dynamic resource allocations within MPI applications. While some forms of dynamicity already exist in MPI, it is currently limited by requiring global synchronization, being application or application-domain specific, or by suffering from limited support in current HPC system software stacks. We overcome these limitations with a simple, yet powerful abstraction: resources as process sets, and changes of resources as set operations leading to a graph-based perspective on resource changes. As the main contribution of this work, we provide an implementation of this approach based on MPI Sessions and PMIx. In addition, an illustration of its usage is provided, as well as a discussion about the required extensions of the PMIx standard. We report results based on a prototype implementation with Open MPI using a synthetic application, as well as a PDE solver benchmark on up to four nodes and a total of 112 cores. Overall, our results show the feasibility of our approach, which has only very moderate overheads. We see this first proof-of-concept as an important step towards resource adaptivity based on MPI Sessions.
{"title":"Towards Dynamic Resource Management with MPI Sessions and PMIx","authors":"Dominik Huber, Maximilian Streubel, Isaías Comprés, M. Schulz, Martin Schreiber, H. Pritchard","doi":"10.1145/3555819.3555856","DOIUrl":"https://doi.org/10.1145/3555819.3555856","url":null,"abstract":"Job management software on peta- and exascale supercomputers continues to provide static resource allocations, from a program’s start until its end. Dynamic resource allocation and management is a research direction that has the potential to improve the efficiency of HPC systems and applications by dynamically adapting the resources of an application during its runtime. Resources can be adapted based on past, current or even future system conditions and matching optimization targets. However, the implementation of dynamic resource management is challenging as it requires support across many layers of the software stack, including the programming model. In this paper, we focus on the latter and present our approach to extend MPI Sessions to support dynamic resource allocations within MPI applications. While some forms of dynamicity already exist in MPI, it is currently limited by requiring global synchronization, being application or application-domain specific, or by suffering from limited support in current HPC system software stacks. We overcome these limitations with a simple, yet powerful abstraction: resources as process sets, and changes of resources as set operations leading to a graph-based perspective on resource changes. As the main contribution of this work, we provide an implementation of this approach based on MPI Sessions and PMIx. In addition, an illustration of its usage is provided, as well as a discussion about the required extensions of the PMIx standard. We report results based on a prototype implementation with Open MPI using a synthetic application, as well as a PDE solver benchmark on up to four nodes and a total of 112 cores. Overall, our results show the feasibility of our approach, which has only very moderate overheads. We see this first proof-of-concept as an important step towards resource adaptivity based on MPI Sessions.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132222497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cell adhesion plays a critical role in processes ranging from leukocyte migration to cancer cell transport during metastasis. Adhesive cell interactions can occur over large distances in microvessel networks with cells traveling over distances much greater than the length scale of their own diameter. Therefore, biologically relevant investigations necessitate efficient modeling of large field-of-view domains, but current models are limited by simulating such geometries at the sub-micron scale required to model adhesive interactions which greatly increases the computational requirements for even small domain sizes. In this study we introduce a hybrid scheme reliant on both on-node and distributed parallelism to accelerate a fully deformable adhesive dynamics cell model. This scheme leads to performant system usage of modern supercomputers which use a many-core per-node architecture. On-node acceleration is augmented by a combination of spatial data structures and algorithmic changes to lessen the need for atomic operations. This deformable adhesive cell model accelerated with hybrid parallelization allows us to bridge the gap between high-resolution cell models which can capture the sub-micron adhesive interactions between the cell and its microenvironment, and large-scale fluid-structure interaction (FSI) models which can track cells over considerable distances. By integrating the sub-micron simulation environment into a distributed FSI simulation we enable the study of previously unfeasible research questions involving numerous adhesive cells in microvessel networks such as cancer cell transport through the microcirculation.
{"title":"Distributed Acceleration of Adhesive Dynamics Simulations","authors":"Daniel F. Puleri, Aristotle X. Martin, A. Randles","doi":"10.1145/3555819.3555832","DOIUrl":"https://doi.org/10.1145/3555819.3555832","url":null,"abstract":"Cell adhesion plays a critical role in processes ranging from leukocyte migration to cancer cell transport during metastasis. Adhesive cell interactions can occur over large distances in microvessel networks with cells traveling over distances much greater than the length scale of their own diameter. Therefore, biologically relevant investigations necessitate efficient modeling of large field-of-view domains, but current models are limited by simulating such geometries at the sub-micron scale required to model adhesive interactions which greatly increases the computational requirements for even small domain sizes. In this study we introduce a hybrid scheme reliant on both on-node and distributed parallelism to accelerate a fully deformable adhesive dynamics cell model. This scheme leads to performant system usage of modern supercomputers which use a many-core per-node architecture. On-node acceleration is augmented by a combination of spatial data structures and algorithmic changes to lessen the need for atomic operations. This deformable adhesive cell model accelerated with hybrid parallelization allows us to bridge the gap between high-resolution cell models which can capture the sub-micron adhesive interactions between the cell and its microenvironment, and large-scale fluid-structure interaction (FSI) models which can track cells over considerable distances. By integrating the sub-micron simulation environment into a distributed FSI simulation we enable the study of previously unfeasible research questions involving numerous adhesive cells in microvessel networks such as cancer cell transport through the microcirculation.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114341112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tim Jammer, Alexander Hück, Jan-Patrick Lehr, Joachim Protze, Simon Schwitanski, Christian H. Bischof
High-performance computing codes often combine the Message-Passing Interface (MPI) with a shared-memory programming model, e.g., OpenMP, for efficient computations. These so-called hybrid models may issue MPI calls concurrently from different threads at the highest level of MPI thread support. The correct use of either MPI or OpenMP can be complex and error-prone. The hybrid model increases this complexity even further. While correctness analysis tools exist for both programming paradigms, for hybrid models, a new set of potential errors exist, whose detection requires combining knowledge of MPI and OpenMP primitives. Unfortunately, correctness tools do not fully support the hybrid model yet, and their current capabilities are also hard to assess. In previous work, to enable structured comparisons of correctness tools and improve their coverage, we proposed the MPI-CorrBench test suite for MPI. Likewise, others proposed the DataRaceBench test suite for OpenMP. However, for the particular error classes of the hybrid model, no such test suite exists. Hence, we propose a hybrid MPI-OpenMP test suite to (1) facilitate the correctness tool development in this area and, subsequently, (2) further encourage the use of the hybrid model at the highest level of MPI thread support. To that end, we discuss issues with this hybrid model and the knowledge correctness tools need to combine w.r.t. MPI and OpenMP to detect these. In our evaluation of two state-of-the-art correctness tools, we see that for most cases of concurrent and conflicting MPI operations, these tools can cope with the added complexity of OpenMP. However, more intricate errors, where user code interferes with MPI, e.g., a data race on a buffer, still evade tool analysis.
{"title":"Towards a Hybrid MPI Correctness Benchmark Suite","authors":"Tim Jammer, Alexander Hück, Jan-Patrick Lehr, Joachim Protze, Simon Schwitanski, Christian H. Bischof","doi":"10.1145/3555819.3555853","DOIUrl":"https://doi.org/10.1145/3555819.3555853","url":null,"abstract":"High-performance computing codes often combine the Message-Passing Interface (MPI) with a shared-memory programming model, e.g., OpenMP, for efficient computations. These so-called hybrid models may issue MPI calls concurrently from different threads at the highest level of MPI thread support. The correct use of either MPI or OpenMP can be complex and error-prone. The hybrid model increases this complexity even further. While correctness analysis tools exist for both programming paradigms, for hybrid models, a new set of potential errors exist, whose detection requires combining knowledge of MPI and OpenMP primitives. Unfortunately, correctness tools do not fully support the hybrid model yet, and their current capabilities are also hard to assess. In previous work, to enable structured comparisons of correctness tools and improve their coverage, we proposed the MPI-CorrBench test suite for MPI. Likewise, others proposed the DataRaceBench test suite for OpenMP. However, for the particular error classes of the hybrid model, no such test suite exists. Hence, we propose a hybrid MPI-OpenMP test suite to (1) facilitate the correctness tool development in this area and, subsequently, (2) further encourage the use of the hybrid model at the highest level of MPI thread support. To that end, we discuss issues with this hybrid model and the knowledge correctness tools need to combine w.r.t. MPI and OpenMP to detect these. In our evaluation of two state-of-the-art correctness tools, we see that for most cases of concurrent and conflicting MPI operations, these tools can cope with the added complexity of OpenMP. However, more intricate errors, where user code interferes with MPI, e.g., a data race on a buffer, still evade tool analysis.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130723441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hui Zhou, Kenneth Raffenetti, Yan-Hua Guo, R. Thakur
The hybrid MPI+X programming paradigm, where X refers to threads or GPUs, has gained prominence in the high-performance computing arena. This corresponds to a trend of system architectures growing more heterogeneous. The current MPI standard only specifies the compatibility levels between MPI and threading runtimes. No MPI concept or interface exists for applications to pass thread context or GPU stream context to MPI implementations explicitly. This lack has made performance optimization complicated in some cases and impossible in other cases. We propose a new concept in MPI, called MPIX stream, to represent the general serial execution context that exists in X runtimes. MPIX streams can be directly mapped to threads or GPU execution streams. Passing thread context into MPI allows implementations to precisely map the execution contexts to network endpoints. Passing GPU execution context into MPI allows implementations to directly operate on GPU streams, lowering the CPU/GPU synchronization cost.
{"title":"MPIX Stream: An Explicit Solution to Hybrid MPI+X Programming","authors":"Hui Zhou, Kenneth Raffenetti, Yan-Hua Guo, R. Thakur","doi":"10.1145/3555819.3555820","DOIUrl":"https://doi.org/10.1145/3555819.3555820","url":null,"abstract":"The hybrid MPI+X programming paradigm, where X refers to threads or GPUs, has gained prominence in the high-performance computing arena. This corresponds to a trend of system architectures growing more heterogeneous. The current MPI standard only specifies the compatibility levels between MPI and threading runtimes. No MPI concept or interface exists for applications to pass thread context or GPU stream context to MPI implementations explicitly. This lack has made performance optimization complicated in some cases and impossible in other cases. We propose a new concept in MPI, called MPIX stream, to represent the general serial execution context that exists in X runtimes. MPIX streams can be directly mapped to threads or GPU execution streams. Passing thread context into MPI allows implementations to precisely map the execution contexts to network endpoints. Passing GPU execution context into MPI allows implementations to directly operate on GPU streams, lowering the CPU/GPU synchronization cost.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115994991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Collective algorithms are an essential part of MPI, allowing application programmers to utilize underlying optimizations of common distributed operations. The MPI_Allgather gathers data, which is originally distributed across all processes, so that all data is available to each process. For small data sizes, the Bruck algorithm is commonly implemented to minimize the maximum number of messages communicated by any process. However, the cost of each step of communication is dependent upon the relative locations of source and destination processes, with non-local messages, such as inter-node, significantly more costly than local messages, such as intra-node. This paper optimizes the Bruck algorithm with locality-awareness, minimizing the number and size of non-local messages to improve performance and scalability of the allgather operation.
{"title":"A Locality-Aware Bruck Allgather","authors":"Amanda Bienz, Shreemant Gautam, Amun Kharel","doi":"10.1145/3555819.3555825","DOIUrl":"https://doi.org/10.1145/3555819.3555825","url":null,"abstract":"Collective algorithms are an essential part of MPI, allowing application programmers to utilize underlying optimizations of common distributed operations. The MPI_Allgather gathers data, which is originally distributed across all processes, so that all data is available to each process. For small data sizes, the Bruck algorithm is commonly implemented to minimize the maximum number of messages communicated by any process. However, the cost of each step of communication is dependent upon the relative locations of source and destination processes, with non-local messages, such as inter-node, significantly more costly than local messages, such as intra-node. This paper optimizes the Bruck algorithm with locality-awareness, minimizing the number and size of non-local messages to improve performance and scalability of the allgather operation.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128131012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
EuroMPI is the successor to the EuroPVM/MPI user group meeting series (since 2010), making EuroMPI 2013 the 20th event of this kind. EuroMPI takes place each year at a different European location; the 2013 meeting was held in Madrid, Spain, organized jointly by the Computer Architecture and Technology Group (ARCOS). Previous meetings were held in Vienna (2012), Santorini (2011), Stuttgart (2010), Espoo (2009), Dublin (2008), Paris (2007), Bonn (2006), Sorrento (2005), Budapest (2004), Venice (2003), Linz (2002), Santorini (2001), Balatonfured (2000), Barcelona (1999), Liverpool (1998), Cracow (1997), Munich (1996), Lyon (1995), and Rome (1994). The meeting took place during September 15--18, 2013.
{"title":"Proceedings of the 29th European MPI Users' Group Meeting","authors":"J. Dongarra, Javier Garcia Blas, J. Carretero","doi":"10.1145/2488551","DOIUrl":"https://doi.org/10.1145/2488551","url":null,"abstract":"EuroMPI is the successor to the EuroPVM/MPI user group meeting series (since 2010), making EuroMPI 2013 the 20th event of this kind. EuroMPI takes place each year at a different European location; the 2013 meeting was held in Madrid, Spain, organized jointly by the Computer Architecture and Technology Group (ARCOS). Previous meetings were held in Vienna (2012), Santorini (2011), Stuttgart (2010), Espoo (2009), Dublin (2008), Paris (2007), Bonn (2006), Sorrento (2005), Budapest (2004), Venice (2003), Linz (2002), Santorini (2001), Balatonfured (2000), Barcelona (1999), Liverpool (1998), Cracow (1997), Munich (1996), Lyon (1995), and Rome (1994). The meeting took place during September 15--18, 2013.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115030257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}