首页 > 最新文献

Proceedings of the 22nd European MPI Users' Group Meeting最新文献

英文 中文
Correctness Analysis of MPI-3 Non-Blocking Communications in PARCOACH PARCOACH中MPI-3非阻塞通信的正确性分析
Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802674
Julien Jaeger, Emmanuelle Saillard, Patrick Carribault, Denis Barthou
MPI-3 provide functions for non-blocking collectives. To help programmers introduce non-blocking collectives to existing MPI programs, we improve the PARCOACH tool for checking correctness of MPI call sequences. These enhancements focus on correct call sequences of all flavor of collective calls, and on the presence of completion calls for all non-blocking communications. The evaluation shows an overhead under 10% of original compilation time.
MPI-3提供了非阻塞集体的功能。为了帮助程序员将非阻塞集合引入现有的MPI程序,我们改进了用于检查MPI调用序列正确性的PARCOACH工具。这些增强的重点是所有类型的集合调用的正确调用序列,以及所有非阻塞通信的完成调用的存在。评估显示开销低于原始编译时间的10%。
{"title":"Correctness Analysis of MPI-3 Non-Blocking Communications in PARCOACH","authors":"Julien Jaeger, Emmanuelle Saillard, Patrick Carribault, Denis Barthou","doi":"10.1145/2802658.2802674","DOIUrl":"https://doi.org/10.1145/2802658.2802674","url":null,"abstract":"MPI-3 provide functions for non-blocking collectives. To help programmers introduce non-blocking collectives to existing MPI programs, we improve the PARCOACH tool for checking correctness of MPI call sequences. These enhancements focus on correct call sequences of all flavor of collective calls, and on the presence of completion calls for all non-blocking communications. The evaluation shows an overhead under 10% of original compilation time.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115209570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
STCI: Scalable RunTime Component Infrastructure STCI:可伸缩的运行时组件基础设施
Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802675
Geoffroy R. Vallée, D. Bernholdt, S. Böhm, T. Naughton
Geoffroy Vallee Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA valleegr@ornl.gov David Bernholdt Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA bernholdtde@ornl.gov Swen Bohm Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA bohms@ornl.gov Thomas Naughton Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA naughtont@ornl.gov
{"title":"STCI: Scalable RunTime Component Infrastructure","authors":"Geoffroy R. Vallée, D. Bernholdt, S. Böhm, T. Naughton","doi":"10.1145/2802658.2802675","DOIUrl":"https://doi.org/10.1145/2802658.2802675","url":null,"abstract":"Geoffroy Vallee Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA valleegr@ornl.gov David Bernholdt Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA bernholdtde@ornl.gov Swen Bohm Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA bohms@ornl.gov Thomas Naughton Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge, Tennessee, USA naughtont@ornl.gov","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123458201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery 计划B:中断正在进行的MPI操作以支持故障恢复
Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802668
Aurélien Bouteiller, G. Bosilca, J. Dongarra
Advanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods.
高性能计算系统中的高级故障恢复策略从本地故障恢复中获益良多,在本地故障恢复中,MPI基础设施可以在进程崩溃中幸存下来并恢复通信服务。在本文中,我们介绍了规范背后的基本原理,以及撤销MPI操作的有效实现。Revoke操作的目的是传播故障知识,并在用户的控制下中断正在进行的、待处理的通信。我们解释了撤销操作可以通过可扩展和故障弹性二项图(BMG)覆盖网络上的可靠广播来实现。在一台Cray XC30超级计算机上进行的大规模评估表明,Revoke操作具有很小的延迟,并且在故障恢复周期之外不会引入系统噪声。
{"title":"Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery","authors":"Aurélien Bouteiller, G. Bosilca, J. Dongarra","doi":"10.1145/2802658.2802668","DOIUrl":"https://doi.org/10.1145/2802658.2802668","url":null,"abstract":"Advanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134115159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Specification Guideline Violations by MPI_Dims_create MPI_Dims_create违反规范准则
Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802677
J. Träff, F. Lübbe
In benchmarking a library providing alternative functionality for structured, so-called isomorphic, sparse collective communication [4], we found use for the MPI_Dims_create functionality of MPI [3] for suggesting a balanced factorization of a given number p (of MPI processes) into d factors that can be used as the dimension sizes in a d-dimensional Cartesian communicator. Much to our surprise, we observed that a) different MPI libraries can differ quite significantly in the factorization they suggest, and b) the produced factorizations can sometimes be quite far from balanced, indeed, for some composite numbers p some MPI libraries sometimes return trivial factorizations (p as factor). This renders the functionality, as implemented, useless. In this poster abstract, we elaborate on these findings.
在对为结构化、所谓同构、稀疏集体通信[4]提供替代功能的库进行基准测试时,我们发现MPI[3]的MPI_Dims_create功能用于建议将给定数量的p (MPI进程)平衡分解为d个因子,这些因子可以用作d维笛卡尔通信器中的维度大小。令我们惊讶的是,我们观察到a)不同的MPI库在它们建议的因子分解方面可能有很大的不同,b)产生的因子分解有时可能远非平衡,实际上,对于某些合数p,一些MPI库有时会返回微不足道的因子分解(p作为因子)。这使得实现的功能变得无用。在这张海报摘要中,我们详细阐述了这些发现。
{"title":"Specification Guideline Violations by MPI_Dims_create","authors":"J. Träff, F. Lübbe","doi":"10.1145/2802658.2802677","DOIUrl":"https://doi.org/10.1145/2802658.2802677","url":null,"abstract":"In benchmarking a library providing alternative functionality for structured, so-called isomorphic, sparse collective communication [4], we found use for the MPI_Dims_create functionality of MPI [3] for suggesting a balanced factorization of a given number p (of MPI processes) into d factors that can be used as the dimension sizes in a d-dimensional Cartesian communicator. Much to our surprise, we observed that a) different MPI libraries can differ quite significantly in the factorization they suggest, and b) the produced factorizations can sometimes be quite far from balanced, indeed, for some composite numbers p some MPI libraries sometimes return trivial factorizations (p as factor). This renders the functionality, as implemented, useless. In this poster abstract, we elaborate on these findings.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131569549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Isomorphic, Sparse MPI-like Collective Communication Operations for Parallel Stencil Computations 并行模板计算的同构、稀疏类mpi集体通信操作
Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802663
J. Träff, F. Lübbe, Antoine Rougier, S. Hunold
We propose a specification and discuss implementations of collective operations for parallel stencil-like computations that are not supported well by the current MPI 3.1 neighborhood collectives. In our isomorphic, sparse collectives all processes partaking in the communication operation use similar neighborhoods of processes with which to exchange data. Our interface assumes the p processes to be arranged in a d-dimensional torus (mesh) over which neighborhoods are specified per process by identical lists of relative coordinates. This extends significantly on the functionality for Cartesian communicators, and is a much lighter mechanism than distributed graph topologies. It allows for fast, local computation of communication schedules, and can be used in more dynamic contexts than current MPI functionality. We sketch three algorithms for neighborhoods with s source and target neighbors, namely a) a direct algorithm taking s communication rounds, b) a message-combining algorithm that communicates only along torus coordinates, and c) a message-combining algorithm using between [log s] and [log p] communication rounds. Our concrete interface has been implemented using the direct algorithm a). We benchmark our implementations and compare to the MPI neighborhood collectives. We demonstrate significant advantages in set-up times, and comparable communication times. Finally, we use our isomorphic, sparse collectives to implement a stencil computation with a deep halo, and discuss derived datatypes required for this application.
我们提出了一个规范,并讨论了当前MPI 3.1邻域集体不支持的并行模板式计算的集体操作的实现。在我们的同构、稀疏的集体中,所有参与通信操作的进程都使用相似的进程邻域来交换数据。我们的界面假设p个进程被安排在一个d维环面(网格)中,每个进程的邻域由相同的相对坐标列表指定。这大大扩展了笛卡尔通信器的功能,是一种比分布式图拓扑轻得多的机制。它允许通信调度的快速本地计算,并且可以在比当前MPI功能更动态的上下文中使用。针对源邻域和目标邻域分别为s个的邻域,我们提出了3种算法,即a)采用s轮通信的直接算法,b)仅沿环面坐标通信的消息组合算法,以及c)采用[log s]和[log p]轮通信的消息组合算法。我们的具体接口已经使用直接算法a)实现。我们对我们的实现进行了基准测试,并与MPI邻域集合进行了比较。我们在设置时间和类似的通信时间方面具有显着优势。最后,我们使用同构的、稀疏的集合来实现具有深晕的模板计算,并讨论该应用程序所需的派生数据类型。
{"title":"Isomorphic, Sparse MPI-like Collective Communication Operations for Parallel Stencil Computations","authors":"J. Träff, F. Lübbe, Antoine Rougier, S. Hunold","doi":"10.1145/2802658.2802663","DOIUrl":"https://doi.org/10.1145/2802658.2802663","url":null,"abstract":"We propose a specification and discuss implementations of collective operations for parallel stencil-like computations that are not supported well by the current MPI 3.1 neighborhood collectives. In our isomorphic, sparse collectives all processes partaking in the communication operation use similar neighborhoods of processes with which to exchange data. Our interface assumes the p processes to be arranged in a d-dimensional torus (mesh) over which neighborhoods are specified per process by identical lists of relative coordinates. This extends significantly on the functionality for Cartesian communicators, and is a much lighter mechanism than distributed graph topologies. It allows for fast, local computation of communication schedules, and can be used in more dynamic contexts than current MPI functionality. We sketch three algorithms for neighborhoods with s source and target neighbors, namely a) a direct algorithm taking s communication rounds, b) a message-combining algorithm that communicates only along torus coordinates, and c) a message-combining algorithm using between [log s] and [log p] communication rounds. Our concrete interface has been implemented using the direct algorithm a). We benchmark our implementations and compare to the MPI neighborhood collectives. We demonstrate significant advantages in set-up times, and comparable communication times. Finally, we use our isomorphic, sparse collectives to implement a stencil computation with a deep halo, and discuss derived datatypes required for this application.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125400825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Efficient, Optimal MPI Datatype Reconstruction for Vector and Index Types 高效,最优的MPI数据类型重建向量和索引类型
Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802671
Martin Kalany, J. Träff
Type reconstruction is the process of finding an efficient representation in terms of space and processing time of a data layout as an MPI derived datatype. Practically efficient type reconstruction and normalization is important for high-quality MPI implementations that strive to provide good performance for communication operations involving noncontiguous data. Although it has recently been shown that the general problem of computing optimal tree representations of derived datatypes allowing any of the MPI derived datatype constructors can be solved in polynomial time, the algorithm for this may unfortunately be impractical for datatypes with large counts. By restricting the allowed constructors to vector and index-block type constructors, but excluding the most general MPI_Type_create_struct constructor, the problem can be solved much more efficiently. More precisely, we give a new O(n log n/log log n) time algorithm for finding cost-optimal representations of MPI type maps of length n using only vector and index-block constructors for a simple but flexible, additive cost model. This improves significantly over a previous O(n√n) time algorithm for the same problem, and the algorithm is simple enough to be considered for practical MPI libraries.
类型重构是在空间和处理时间方面找到数据布局作为MPI派生数据类型的有效表示的过程。实际上,高效的类型重构和规范化对于努力为涉及不连续数据的通信操作提供良好性能的高质量MPI实现非常重要。尽管最近已经证明,计算派生数据类型的最优树表示的一般问题(允许任何MPI派生数据类型构造函数)可以在多项式时间内解决,但不幸的是,对于具有大计数的数据类型,该算法可能不切实际。通过将允许的构造函数限制为矢量和索引块类型的构造函数,但排除最通用的MPI_Type_create_struct构造函数,可以更有效地解决这个问题。更准确地说,我们给出了一个新的O(n log n/log log n)时间算法,用于寻找长度为n的MPI类型映射的成本最优表示,该算法仅使用向量和索引块构造函数,用于简单但灵活的附加成本模型。对于相同的问题,这比之前的O(n√n)时间算法有了显著的改进,并且该算法足够简单,可以考虑用于实际的MPI库。
{"title":"Efficient, Optimal MPI Datatype Reconstruction for Vector and Index Types","authors":"Martin Kalany, J. Träff","doi":"10.1145/2802658.2802671","DOIUrl":"https://doi.org/10.1145/2802658.2802671","url":null,"abstract":"Type reconstruction is the process of finding an efficient representation in terms of space and processing time of a data layout as an MPI derived datatype. Practically efficient type reconstruction and normalization is important for high-quality MPI implementations that strive to provide good performance for communication operations involving noncontiguous data. Although it has recently been shown that the general problem of computing optimal tree representations of derived datatypes allowing any of the MPI derived datatype constructors can be solved in polynomial time, the algorithm for this may unfortunately be impractical for datatypes with large counts. By restricting the allowed constructors to vector and index-block type constructors, but excluding the most general MPI_Type_create_struct constructor, the problem can be solved much more efficiently. More precisely, we give a new O(n log n/log log n) time algorithm for finding cost-optimal representations of MPI type maps of length n using only vector and index-block constructors for a simple but flexible, additive cost model. This improves significantly over a previous O(n√n) time algorithm for the same problem, and the algorithm is simple enough to be considered for practical MPI libraries.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125604744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
MPI Advisor: a Minimal Overhead Tool for MPI Library Performance Tuning MPI顾问:MPI库性能调优的最小开销工具
Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802667
E. Gallardo, Jérôme Vienne, L. Fialho, P. Teller, J. Browne
A majority of parallel applications executed on HPC clusters use MPI for communication between processes. Most users treat MPI as a black box, executing their programs using the cluster's default settings. While the default settings perform adequately for many cases, it is well known that optimizing the MPI environment can significantly improve application performance. Although the existing optimization tools are effective when used by performance experts, they require deep knowledge of MPI library behavior and the underlying hardware architecture in which the application will be executed. Therefore, an easy-to-use tool that provides recommendations for configuring the MPI environment to optimize application performance is highly desirable. This paper addresses this need by presenting an easy-to-use methodology and tool, named MPI Advisor, that requires just a single execution of the input application to characterize its predominant communication behavior and determine the MPI configuration that may enhance its performance on the target combination of MPI library and hardware architecture. Currently, MPI Advisor provides recommendations that address the four most commonly occurring MPI-related performance bottlenecks, which are related to the choice of: 1) point-to-point protocol (eager vs. rendezvous), 2) collective communication algorithm, 3) MPI tasks-to-cores mapping, and 4) Infiniband transport protocol. The performance gains obtained by implementing the recommended optimizations in the case studies presented in this paper range from a few percent to more than 40%. Specifically, using this tool, we were able to improve the performance of HPCG with MVAPICH2 on four nodes of the Stampede cluster from 6.9 GFLOP/s to 10.1 GFLOP/s. Since the tool provides application-specific recommendations, it also informs the user about correct usage of MPI.
在HPC集群上执行的大多数并行应用程序使用MPI进行进程之间的通信。大多数用户将MPI视为一个黑盒,使用集群的默认设置执行他们的程序。虽然默认设置在许多情况下都能充分发挥作用,但众所周知,优化MPI环境可以显著提高应用程序性能。虽然现有的优化工具在性能专家使用时是有效的,但它们需要深入了解MPI库行为和执行应用程序的底层硬件体系结构。因此,非常需要一个易于使用的工具,该工具提供了如何配置MPI环境以优化应用程序性能的建议。本文通过提出一种易于使用的方法和工具来解决这一需求,该方法和工具名为MPI Advisor,它只需要执行一次输入应用程序,就可以表征其主要的通信行为,并确定可以在MPI库和硬件架构的目标组合上提高其性能的MPI配置。目前,MPI Advisor提供了解决四个最常见的MPI相关性能瓶颈的建议,这些瓶颈与以下选择有关:1)点对点协议(渴望与会合),2)集体通信算法,3)MPI任务到核心映射,以及4)无限带宽传输协议。通过实现本文提供的案例研究中推荐的优化,所获得的性能增益从几个百分点到超过40%不等。具体地说,使用这个工具,我们能够在Stampede集群的四个节点上将MVAPICH2的HPCG性能从6.9 GFLOP/s提高到10.1 GFLOP/s。由于该工具提供了特定于应用程序的建议,因此它还告知用户如何正确使用MPI。
{"title":"MPI Advisor: a Minimal Overhead Tool for MPI Library Performance Tuning","authors":"E. Gallardo, Jérôme Vienne, L. Fialho, P. Teller, J. Browne","doi":"10.1145/2802658.2802667","DOIUrl":"https://doi.org/10.1145/2802658.2802667","url":null,"abstract":"A majority of parallel applications executed on HPC clusters use MPI for communication between processes. Most users treat MPI as a black box, executing their programs using the cluster's default settings. While the default settings perform adequately for many cases, it is well known that optimizing the MPI environment can significantly improve application performance. Although the existing optimization tools are effective when used by performance experts, they require deep knowledge of MPI library behavior and the underlying hardware architecture in which the application will be executed. Therefore, an easy-to-use tool that provides recommendations for configuring the MPI environment to optimize application performance is highly desirable. This paper addresses this need by presenting an easy-to-use methodology and tool, named MPI Advisor, that requires just a single execution of the input application to characterize its predominant communication behavior and determine the MPI configuration that may enhance its performance on the target combination of MPI library and hardware architecture. Currently, MPI Advisor provides recommendations that address the four most commonly occurring MPI-related performance bottlenecks, which are related to the choice of: 1) point-to-point protocol (eager vs. rendezvous), 2) collective communication algorithm, 3) MPI tasks-to-cores mapping, and 4) Infiniband transport protocol. The performance gains obtained by implementing the recommended optimizations in the case studies presented in this paper range from a few percent to more than 40%. Specifically, using this tool, we were able to improve the performance of HPCG with MVAPICH2 on four nodes of the Stampede cluster from 6.9 GFLOP/s to 10.1 GFLOP/s. Since the tool provides application-specific recommendations, it also informs the user about correct usage of MPI.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126794703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
DAME: A Runtime-Compiled Engine for Derived Datatypes DAME:用于派生数据类型的运行时编译引擎
Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802659
Tarun Prabhu, W. Gropp
In order to achieve high performance on modern and future machines, applications need to make effective use of the complex, hierarchical memory system. Writing performance-portable code continues to be challenging since each architecture has unique memory access characteristics. In addition, some optimization decisions can only reasonably be made at runtime. This suggests that a two-pronged approach to address the challenge is required. First, provide the programmer with a means to express memory operations declaratively which will allow a runtime system to transparently access the memory in the best way and second, exploit runtime information. MPI's derived datatypes accomplish the former although their performance in current MPI implementations shows scope for improvement. JIT-compilation can be used for the latter. In this work, we present DAME --- a language and interpreter that is used as the backend for MPI's derived datatypes. We also present DAME-L and DAME-X, two JIT-enabled implementations of DAME. All three implementations have been integrated into MPICH. We evaluate the performance of our implementations using DDTBench and two mini-applications written with MPI derived datatypes and obtain communication speedups of up to 20x and mini-application speedup of 3x.
为了在现代和未来的机器上实现高性能,应用程序需要有效地利用复杂的分层存储系统。编写性能可移植的代码仍然具有挑战性,因为每种体系结构都有独特的内存访问特性。此外,一些优化决策只能在运行时合理地做出。这表明,需要采取双管齐下的方法来应对这一挑战。首先,为程序员提供一种声明性地表达内存操作的方法,这将允许运行时系统以最佳方式透明地访问内存;其次,利用运行时信息。MPI的派生数据类型完成了前者,尽管它们在当前MPI实现中的性能显示出有待改进的空间。对于后者,可以使用jit编译。在这项工作中,我们介绍了DAME——一种语言和解释器,用作MPI派生数据类型的后端。我们还介绍了DAME的两个支持jit的实现——DAME- l和DAME- x。这三种实现都已集成到MPICH中。我们使用DDTBench和两个使用MPI派生数据类型编写的迷你应用程序来评估我们的实现的性能,并获得高达20倍的通信速度和3倍的迷你应用程序速度。
{"title":"DAME: A Runtime-Compiled Engine for Derived Datatypes","authors":"Tarun Prabhu, W. Gropp","doi":"10.1145/2802658.2802659","DOIUrl":"https://doi.org/10.1145/2802658.2802659","url":null,"abstract":"In order to achieve high performance on modern and future machines, applications need to make effective use of the complex, hierarchical memory system. Writing performance-portable code continues to be challenging since each architecture has unique memory access characteristics. In addition, some optimization decisions can only reasonably be made at runtime. This suggests that a two-pronged approach to address the challenge is required. First, provide the programmer with a means to express memory operations declaratively which will allow a runtime system to transparently access the memory in the best way and second, exploit runtime information. MPI's derived datatypes accomplish the former although their performance in current MPI implementations shows scope for improvement. JIT-compilation can be used for the latter. In this work, we present DAME --- a language and interpreter that is used as the backend for MPI's derived datatypes. We also present DAME-L and DAME-X, two JIT-enabled implementations of DAME. All three implementations have been integrated into MPICH. We evaluate the performance of our implementations using DDTBench and two mini-applications written with MPI derived datatypes and obtain communication speedups of up to 20x and mini-application speedup of 3x.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132139821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Performance Evaluation of OpenFOAM* with MPI-3 RMA Routines on Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors OpenFOAM*与MPI-3 RMA例程在Intel®Xeon®处理器和Intel®Xeon Phi™协处理器上的性能评估
Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802676
Nishant Agrawal, Paul Edwards, Ambuj Pandey, Michael Klemm, Ravi Ojha, R. A. Razak
OpenFOAM is a software package for solving partial differential equations and is very popular for computational fluid dynamics in the automotive segment. In this work, we describe our evaluation of the performance of OpenFOAM with MPI-3 Remote Memory Access (RMA) one-sided communication on the Intel® Xeon Phi" coprocessor. Currently, OpenFOAM computes on a mesh that is decomposed among different MPI ranks, and it requires a high amount of communication between the neighboring ranks. MPI-3 offers RMA through a new API that decouples communication and synchronization. The aim is to achieve better performance with MPI-3 RMA routines as compared to the current two-sided asynchronous communication routines in OpenFOAM. We also showcase the challenges overcome in order to facilitate the different MPI-3 RMA routines in OpenFOAM. This discussion aims at analyzing the potential of MPI-3 RMA in OpenFOAM and benchmarking the performance on both the processor and the coprocessor. Our work also demonstrates that MPI-3 RMA in OpenFOAM can run in symmetric mode consisting of the Intel® Xeon® E5-2697v3 processor and the Intel® Xeon Phi™ 7120P coprocessor.
OpenFOAM是一个用于求解偏微分方程的软件包,在汽车领域的计算流体动力学中非常流行。在这项工作中,我们描述了我们在Intel®Xeon Phi协处理器上使用MPI-3远程内存访问(RMA)单侧通信对OpenFOAM性能的评估。目前,OpenFOAM是在不同MPI等级之间分解的网格上进行计算的,它需要在相邻等级之间进行大量的通信。MPI-3通过一个新的API提供RMA,该API将通信和同步分离。与OpenFOAM中当前的双边异步通信例程相比,其目的是通过MPI-3 RMA例程实现更好的性能。我们还展示了为了促进OpenFOAM中不同的MPI-3 RMA例程所克服的挑战。本讨论旨在分析OpenFOAM中MPI-3 RMA的潜力,并对处理器和协处理器的性能进行基准测试。我们的工作还表明,OpenFOAM中的MPI-3 RMA可以在由Intel®Xeon®E5-2697v3处理器和Intel®Xeon Phi™7120P协处理器组成的对称模式下运行。
{"title":"Performance Evaluation of OpenFOAM* with MPI-3 RMA Routines on Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors","authors":"Nishant Agrawal, Paul Edwards, Ambuj Pandey, Michael Klemm, Ravi Ojha, R. A. Razak","doi":"10.1145/2802658.2802676","DOIUrl":"https://doi.org/10.1145/2802658.2802676","url":null,"abstract":"OpenFOAM is a software package for solving partial differential equations and is very popular for computational fluid dynamics in the automotive segment. In this work, we describe our evaluation of the performance of OpenFOAM with MPI-3 Remote Memory Access (RMA) one-sided communication on the Intel® Xeon Phi\" coprocessor. Currently, OpenFOAM computes on a mesh that is decomposed among different MPI ranks, and it requires a high amount of communication between the neighboring ranks. MPI-3 offers RMA through a new API that decouples communication and synchronization. The aim is to achieve better performance with MPI-3 RMA routines as compared to the current two-sided asynchronous communication routines in OpenFOAM. We also showcase the challenges overcome in order to facilitate the different MPI-3 RMA routines in OpenFOAM. This discussion aims at analyzing the potential of MPI-3 RMA in OpenFOAM and benchmarking the performance on both the processor and the coprocessor. Our work also demonstrates that MPI-3 RMA in OpenFOAM can run in symmetric mode consisting of the Intel® Xeon® E5-2697v3 processor and the Intel® Xeon Phi™ 7120P coprocessor.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133461024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On the Impact of Synchronizing Clocks and Processes on Benchmarking MPI Collectives 论时钟和进程同步对MPI集体基准测试的影响
Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802662
S. Hunold, Alexandra Carpen-Amarie
We consider the problem of accurately measuring the time to complete an MPI collective operation, as the result strongly depends on how the time is measured. Our goal is to develop an experimental method that allows for reproducible measurements of MPI collectives. When executing large parallel codes, MPI processes are often skewed in time when entering a collective operation. However, to obtain reproducible measurements, it is a common approach to synchronize all processes before they call the MPI collective operation. We therefore take a closer look at two commonly used process synchronization schemes: (1) relying on MPI_Barrier or (2) applying a window-based scheme using a common global time. We analyze both schemes experimentally and show the strengths and weaknesses of each approach. As window-based schemes require the notion of global time, we thoroughly evaluate different clock synchronization algorithms in various experiments. We also propose a novel clock synchronization algorithm that combines two advantages of known algorithms, which are (1) taking the inherent clock drift into account and (2) using a tree-based synchronization scheme to reduce the synchronization duration.
我们考虑精确测量完成MPI集体操作的时间的问题,因为结果很大程度上取决于时间的测量方式。我们的目标是开发一种实验方法,允许可重复测量MPI集体。在执行大型并行代码时,MPI进程在进入集合操作时经常出现时间偏差。然而,为了获得可重复的测量,在调用MPI集合操作之前同步所有进程是一种常见的方法。因此,我们将仔细研究两种常用的进程同步方案:(1)依赖于MPI_Barrier或(2)使用公共全局时间应用基于窗口的方案。我们对这两种方案进行了实验分析,并展示了每种方法的优缺点。由于基于窗口的方案需要全局时间的概念,我们在各种实验中全面评估了不同的时钟同步算法。我们还提出了一种新的时钟同步算法,该算法结合了已知算法的两个优点,即:(1)考虑了固有的时钟漂移;(2)使用基于树的同步方案来减少同步持续时间。
{"title":"On the Impact of Synchronizing Clocks and Processes on Benchmarking MPI Collectives","authors":"S. Hunold, Alexandra Carpen-Amarie","doi":"10.1145/2802658.2802662","DOIUrl":"https://doi.org/10.1145/2802658.2802662","url":null,"abstract":"We consider the problem of accurately measuring the time to complete an MPI collective operation, as the result strongly depends on how the time is measured. Our goal is to develop an experimental method that allows for reproducible measurements of MPI collectives. When executing large parallel codes, MPI processes are often skewed in time when entering a collective operation. However, to obtain reproducible measurements, it is a common approach to synchronize all processes before they call the MPI collective operation. We therefore take a closer look at two commonly used process synchronization schemes: (1) relying on MPI_Barrier or (2) applying a window-based scheme using a common global time. We analyze both schemes experimentally and show the strengths and weaknesses of each approach. As window-based schemes require the notion of global time, we thoroughly evaluate different clock synchronization algorithms in various experiments. We also propose a novel clock synchronization algorithm that combines two advantages of known algorithms, which are (1) taking the inherent clock drift into account and (2) using a tree-based synchronization scheme to reduce the synchronization duration.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"07 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131216660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
Proceedings of the 22nd European MPI Users' Group Meeting
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1