Julien Jaeger, Emmanuelle Saillard, Patrick Carribault, Denis Barthou
MPI-3 provide functions for non-blocking collectives. To help programmers introduce non-blocking collectives to existing MPI programs, we improve the PARCOACH tool for checking correctness of MPI call sequences. These enhancements focus on correct call sequences of all flavor of collective calls, and on the presence of completion calls for all non-blocking communications. The evaluation shows an overhead under 10% of original compilation time.
{"title":"Correctness Analysis of MPI-3 Non-Blocking Communications in PARCOACH","authors":"Julien Jaeger, Emmanuelle Saillard, Patrick Carribault, Denis Barthou","doi":"10.1145/2802658.2802674","DOIUrl":"https://doi.org/10.1145/2802658.2802674","url":null,"abstract":"MPI-3 provide functions for non-blocking collectives. To help programmers introduce non-blocking collectives to existing MPI programs, we improve the PARCOACH tool for checking correctness of MPI call sequences. These enhancements focus on correct call sequences of all flavor of collective calls, and on the presence of completion calls for all non-blocking communications. The evaluation shows an overhead under 10% of original compilation time.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115209570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geoffroy R. Vallée, D. Bernholdt, S. Böhm, T. Naughton
Geoffroy Vallee Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA valleegr@ornl.gov David Bernholdt Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA bernholdtde@ornl.gov Swen Bohm Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA bohms@ornl.gov Thomas Naughton Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA naughtont@ornl.gov
{"title":"STCI: Scalable RunTime Component Infrastructure","authors":"Geoffroy R. Vallée, D. Bernholdt, S. Böhm, T. Naughton","doi":"10.1145/2802658.2802675","DOIUrl":"https://doi.org/10.1145/2802658.2802675","url":null,"abstract":"Geoffroy Vallee Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA valleegr@ornl.gov David Bernholdt Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA bernholdtde@ornl.gov Swen Bohm Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA bohms@ornl.gov Thomas Naughton Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA naughtont@ornl.gov","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123458201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods.
{"title":"Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery","authors":"Aurélien Bouteiller, G. Bosilca, J. Dongarra","doi":"10.1145/2802658.2802668","DOIUrl":"https://doi.org/10.1145/2802658.2802668","url":null,"abstract":"Advanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134115159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In benchmarking a library providing alternative functionality for structured, so-called isomorphic, sparse collective communication [4], we found use for the MPI_Dims_create functionality of MPI [3] for suggesting a balanced factorization of a given number p (of MPI processes) into d factors that can be used as the dimension sizes in a d-dimensional Cartesian communicator. Much to our surprise, we observed that a) different MPI libraries can differ quite significantly in the factorization they suggest, and b) the produced factorizations can sometimes be quite far from balanced, indeed, for some composite numbers p some MPI libraries sometimes return trivial factorizations (p as factor). This renders the functionality, as implemented, useless. In this poster abstract, we elaborate on these findings.
{"title":"Specification Guideline Violations by MPI_Dims_create","authors":"J. Träff, F. Lübbe","doi":"10.1145/2802658.2802677","DOIUrl":"https://doi.org/10.1145/2802658.2802677","url":null,"abstract":"In benchmarking a library providing alternative functionality for structured, so-called isomorphic, sparse collective communication [4], we found use for the MPI_Dims_create functionality of MPI [3] for suggesting a balanced factorization of a given number p (of MPI processes) into d factors that can be used as the dimension sizes in a d-dimensional Cartesian communicator. Much to our surprise, we observed that a) different MPI libraries can differ quite significantly in the factorization they suggest, and b) the produced factorizations can sometimes be quite far from balanced, indeed, for some composite numbers p some MPI libraries sometimes return trivial factorizations (p as factor). This renders the functionality, as implemented, useless. In this poster abstract, we elaborate on these findings.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131569549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a specification and discuss implementations of collective operations for parallel stencil-like computations that are not supported well by the current MPI 3.1 neighborhood collectives. In our isomorphic, sparse collectives all processes partaking in the communication operation use similar neighborhoods of processes with which to exchange data. Our interface assumes the p processes to be arranged in a d-dimensional torus (mesh) over which neighborhoods are specified per process by identical lists of relative coordinates. This extends significantly on the functionality for Cartesian communicators, and is a much lighter mechanism than distributed graph topologies. It allows for fast, local computation of communication schedules, and can be used in more dynamic contexts than current MPI functionality. We sketch three algorithms for neighborhoods with s source and target neighbors, namely a) a direct algorithm taking s communication rounds, b) a message-combining algorithm that communicates only along torus coordinates, and c) a message-combining algorithm using between [log s] and [log p] communication rounds. Our concrete interface has been implemented using the direct algorithm a). We benchmark our implementations and compare to the MPI neighborhood collectives. We demonstrate significant advantages in set-up times, and comparable communication times. Finally, we use our isomorphic, sparse collectives to implement a stencil computation with a deep halo, and discuss derived datatypes required for this application.
{"title":"Isomorphic, Sparse MPI-like Collective Communication Operations for Parallel Stencil Computations","authors":"J. Träff, F. Lübbe, Antoine Rougier, S. Hunold","doi":"10.1145/2802658.2802663","DOIUrl":"https://doi.org/10.1145/2802658.2802663","url":null,"abstract":"We propose a specification and discuss implementations of collective operations for parallel stencil-like computations that are not supported well by the current MPI 3.1 neighborhood collectives. In our isomorphic, sparse collectives all processes partaking in the communication operation use similar neighborhoods of processes with which to exchange data. Our interface assumes the p processes to be arranged in a d-dimensional torus (mesh) over which neighborhoods are specified per process by identical lists of relative coordinates. This extends significantly on the functionality for Cartesian communicators, and is a much lighter mechanism than distributed graph topologies. It allows for fast, local computation of communication schedules, and can be used in more dynamic contexts than current MPI functionality. We sketch three algorithms for neighborhoods with s source and target neighbors, namely a) a direct algorithm taking s communication rounds, b) a message-combining algorithm that communicates only along torus coordinates, and c) a message-combining algorithm using between [log s] and [log p] communication rounds. Our concrete interface has been implemented using the direct algorithm a). We benchmark our implementations and compare to the MPI neighborhood collectives. We demonstrate significant advantages in set-up times, and comparable communication times. Finally, we use our isomorphic, sparse collectives to implement a stencil computation with a deep halo, and discuss derived datatypes required for this application.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125400825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Type reconstruction is the process of finding an efficient representation in terms of space and processing time of a data layout as an MPI derived datatype. Practically efficient type reconstruction and normalization is important for high-quality MPI implementations that strive to provide good performance for communication operations involving noncontiguous data. Although it has recently been shown that the general problem of computing optimal tree representations of derived datatypes allowing any of the MPI derived datatype constructors can be solved in polynomial time, the algorithm for this may unfortunately be impractical for datatypes with large counts. By restricting the allowed constructors to vector and index-block type constructors, but excluding the most general MPI_Type_create_struct constructor, the problem can be solved much more efficiently. More precisely, we give a new O(n log n/log log n) time algorithm for finding cost-optimal representations of MPI type maps of length n using only vector and index-block constructors for a simple but flexible, additive cost model. This improves significantly over a previous O(n√n) time algorithm for the same problem, and the algorithm is simple enough to be considered for practical MPI libraries.
{"title":"Efficient, Optimal MPI Datatype Reconstruction for Vector and Index Types","authors":"Martin Kalany, J. Träff","doi":"10.1145/2802658.2802671","DOIUrl":"https://doi.org/10.1145/2802658.2802671","url":null,"abstract":"Type reconstruction is the process of finding an efficient representation in terms of space and processing time of a data layout as an MPI derived datatype. Practically efficient type reconstruction and normalization is important for high-quality MPI implementations that strive to provide good performance for communication operations involving noncontiguous data. Although it has recently been shown that the general problem of computing optimal tree representations of derived datatypes allowing any of the MPI derived datatype constructors can be solved in polynomial time, the algorithm for this may unfortunately be impractical for datatypes with large counts. By restricting the allowed constructors to vector and index-block type constructors, but excluding the most general MPI_Type_create_struct constructor, the problem can be solved much more efficiently. More precisely, we give a new O(n log n/log log n) time algorithm for finding cost-optimal representations of MPI type maps of length n using only vector and index-block constructors for a simple but flexible, additive cost model. This improves significantly over a previous O(n√n) time algorithm for the same problem, and the algorithm is simple enough to be considered for practical MPI libraries.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125604744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Gallardo, Jérôme Vienne, L. Fialho, P. Teller, J. Browne
A majority of parallel applications executed on HPC clusters use MPI for communication between processes. Most users treat MPI as a black box, executing their programs using the cluster's default settings. While the default settings perform adequately for many cases, it is well known that optimizing the MPI environment can significantly improve application performance. Although the existing optimization tools are effective when used by performance experts, they require deep knowledge of MPI library behavior and the underlying hardware architecture in which the application will be executed. Therefore, an easy-to-use tool that provides recommendations for configuring the MPI environment to optimize application performance is highly desirable. This paper addresses this need by presenting an easy-to-use methodology and tool, named MPI Advisor, that requires just a single execution of the input application to characterize its predominant communication behavior and determine the MPI configuration that may enhance its performance on the target combination of MPI library and hardware architecture. Currently, MPI Advisor provides recommendations that address the four most commonly occurring MPI-related performance bottlenecks, which are related to the choice of: 1) point-to-point protocol (eager vs. rendezvous), 2) collective communication algorithm, 3) MPI tasks-to-cores mapping, and 4) Infiniband transport protocol. The performance gains obtained by implementing the recommended optimizations in the case studies presented in this paper range from a few percent to more than 40%. Specifically, using this tool, we were able to improve the performance of HPCG with MVAPICH2 on four nodes of the Stampede cluster from 6.9 GFLOP/s to 10.1 GFLOP/s. Since the tool provides application-specific recommendations, it also informs the user about correct usage of MPI.
{"title":"MPI Advisor: a Minimal Overhead Tool for MPI Library Performance Tuning","authors":"E. Gallardo, Jérôme Vienne, L. Fialho, P. Teller, J. Browne","doi":"10.1145/2802658.2802667","DOIUrl":"https://doi.org/10.1145/2802658.2802667","url":null,"abstract":"A majority of parallel applications executed on HPC clusters use MPI for communication between processes. Most users treat MPI as a black box, executing their programs using the cluster's default settings. While the default settings perform adequately for many cases, it is well known that optimizing the MPI environment can significantly improve application performance. Although the existing optimization tools are effective when used by performance experts, they require deep knowledge of MPI library behavior and the underlying hardware architecture in which the application will be executed. Therefore, an easy-to-use tool that provides recommendations for configuring the MPI environment to optimize application performance is highly desirable. This paper addresses this need by presenting an easy-to-use methodology and tool, named MPI Advisor, that requires just a single execution of the input application to characterize its predominant communication behavior and determine the MPI configuration that may enhance its performance on the target combination of MPI library and hardware architecture. Currently, MPI Advisor provides recommendations that address the four most commonly occurring MPI-related performance bottlenecks, which are related to the choice of: 1) point-to-point protocol (eager vs. rendezvous), 2) collective communication algorithm, 3) MPI tasks-to-cores mapping, and 4) Infiniband transport protocol. The performance gains obtained by implementing the recommended optimizations in the case studies presented in this paper range from a few percent to more than 40%. Specifically, using this tool, we were able to improve the performance of HPCG with MVAPICH2 on four nodes of the Stampede cluster from 6.9 GFLOP/s to 10.1 GFLOP/s. Since the tool provides application-specific recommendations, it also informs the user about correct usage of MPI.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126794703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to achieve high performance on modern and future machines, applications need to make effective use of the complex, hierarchical memory system. Writing performance-portable code continues to be challenging since each architecture has unique memory access characteristics. In addition, some optimization decisions can only reasonably be made at runtime. This suggests that a two-pronged approach to address the challenge is required. First, provide the programmer with a means to express memory operations declaratively which will allow a runtime system to transparently access the memory in the best way and second, exploit runtime information. MPI's derived datatypes accomplish the former although their performance in current MPI implementations shows scope for improvement. JIT-compilation can be used for the latter. In this work, we present DAME --- a language and interpreter that is used as the backend for MPI's derived datatypes. We also present DAME-L and DAME-X, two JIT-enabled implementations of DAME. All three implementations have been integrated into MPICH. We evaluate the performance of our implementations using DDTBench and two mini-applications written with MPI derived datatypes and obtain communication speedups of up to 20x and mini-application speedup of 3x.
{"title":"DAME: A Runtime-Compiled Engine for Derived Datatypes","authors":"Tarun Prabhu, W. Gropp","doi":"10.1145/2802658.2802659","DOIUrl":"https://doi.org/10.1145/2802658.2802659","url":null,"abstract":"In order to achieve high performance on modern and future machines, applications need to make effective use of the complex, hierarchical memory system. Writing performance-portable code continues to be challenging since each architecture has unique memory access characteristics. In addition, some optimization decisions can only reasonably be made at runtime. This suggests that a two-pronged approach to address the challenge is required. First, provide the programmer with a means to express memory operations declaratively which will allow a runtime system to transparently access the memory in the best way and second, exploit runtime information. MPI's derived datatypes accomplish the former although their performance in current MPI implementations shows scope for improvement. JIT-compilation can be used for the latter. In this work, we present DAME --- a language and interpreter that is used as the backend for MPI's derived datatypes. We also present DAME-L and DAME-X, two JIT-enabled implementations of DAME. All three implementations have been integrated into MPICH. We evaluate the performance of our implementations using DDTBench and two mini-applications written with MPI derived datatypes and obtain communication speedups of up to 20x and mini-application speedup of 3x.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132139821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nishant Agrawal, Paul Edwards, Ambuj Pandey, Michael Klemm, Ravi Ojha, R. A. Razak
OpenFOAM is a software package for solving partial differential equations and is very popular for computational fluid dynamics in the automotive segment. In this work, we describe our evaluation of the performance of OpenFOAM with MPI-3 Remote Memory Access (RMA) one-sided communication on the Intel® Xeon Phi" coprocessor. Currently, OpenFOAM computes on a mesh that is decomposed among different MPI ranks, and it requires a high amount of communication between the neighboring ranks. MPI-3 offers RMA through a new API that decouples communication and synchronization. The aim is to achieve better performance with MPI-3 RMA routines as compared to the current two-sided asynchronous communication routines in OpenFOAM. We also showcase the challenges overcome in order to facilitate the different MPI-3 RMA routines in OpenFOAM. This discussion aims at analyzing the potential of MPI-3 RMA in OpenFOAM and benchmarking the performance on both the processor and the coprocessor. Our work also demonstrates that MPI-3 RMA in OpenFOAM can run in symmetric mode consisting of the Intel® Xeon® E5-2697v3 processor and the Intel® Xeon Phi™ 7120P coprocessor.
{"title":"Performance Evaluation of OpenFOAM* with MPI-3 RMA Routines on Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors","authors":"Nishant Agrawal, Paul Edwards, Ambuj Pandey, Michael Klemm, Ravi Ojha, R. A. Razak","doi":"10.1145/2802658.2802676","DOIUrl":"https://doi.org/10.1145/2802658.2802676","url":null,"abstract":"OpenFOAM is a software package for solving partial differential equations and is very popular for computational fluid dynamics in the automotive segment. In this work, we describe our evaluation of the performance of OpenFOAM with MPI-3 Remote Memory Access (RMA) one-sided communication on the Intel® Xeon Phi\" coprocessor. Currently, OpenFOAM computes on a mesh that is decomposed among different MPI ranks, and it requires a high amount of communication between the neighboring ranks. MPI-3 offers RMA through a new API that decouples communication and synchronization. The aim is to achieve better performance with MPI-3 RMA routines as compared to the current two-sided asynchronous communication routines in OpenFOAM. We also showcase the challenges overcome in order to facilitate the different MPI-3 RMA routines in OpenFOAM. This discussion aims at analyzing the potential of MPI-3 RMA in OpenFOAM and benchmarking the performance on both the processor and the coprocessor. Our work also demonstrates that MPI-3 RMA in OpenFOAM can run in symmetric mode consisting of the Intel® Xeon® E5-2697v3 processor and the Intel® Xeon Phi™ 7120P coprocessor.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133461024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of accurately measuring the time to complete an MPI collective operation, as the result strongly depends on how the time is measured. Our goal is to develop an experimental method that allows for reproducible measurements of MPI collectives. When executing large parallel codes, MPI processes are often skewed in time when entering a collective operation. However, to obtain reproducible measurements, it is a common approach to synchronize all processes before they call the MPI collective operation. We therefore take a closer look at two commonly used process synchronization schemes: (1) relying on MPI_Barrier or (2) applying a window-based scheme using a common global time. We analyze both schemes experimentally and show the strengths and weaknesses of each approach. As window-based schemes require the notion of global time, we thoroughly evaluate different clock synchronization algorithms in various experiments. We also propose a novel clock synchronization algorithm that combines two advantages of known algorithms, which are (1) taking the inherent clock drift into account and (2) using a tree-based synchronization scheme to reduce the synchronization duration.
{"title":"On the Impact of Synchronizing Clocks and Processes on Benchmarking MPI Collectives","authors":"S. Hunold, Alexandra Carpen-Amarie","doi":"10.1145/2802658.2802662","DOIUrl":"https://doi.org/10.1145/2802658.2802662","url":null,"abstract":"We consider the problem of accurately measuring the time to complete an MPI collective operation, as the result strongly depends on how the time is measured. Our goal is to develop an experimental method that allows for reproducible measurements of MPI collectives. When executing large parallel codes, MPI processes are often skewed in time when entering a collective operation. However, to obtain reproducible measurements, it is a common approach to synchronize all processes before they call the MPI collective operation. We therefore take a closer look at two commonly used process synchronization schemes: (1) relying on MPI_Barrier or (2) applying a window-based scheme using a common global time. We analyze both schemes experimentally and show the strengths and weaknesses of each approach. As window-based schemes require the notion of global time, we thoroughly evaluate different clock synchronization algorithms in various experiments. We also propose a novel clock synchronization algorithm that combines two advantages of known algorithms, which are (1) taking the inherent clock drift into account and (2) using a tree-based synchronization scheme to reduce the synchronization duration.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"07 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131216660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}