Proceedings of the 22nd European MPI Users' Group Meeting最新文献

英文中文

Sliding Substitution of Failed Nodes 失效节点滑动替换

Proceedings of the 22nd European MPI Users' Group Meeting

Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802670

A. Hori, Kazumi Yoshinaga, T. Hérault, Aurélien Bouteiller, G. Bosilca, Y. Ishikawa

This paper considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the node- rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this paper, several spare-node allocation and nodesubstitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. It will be shown that when a failure occurs, the peer-to-peer (P2P) communication performance on the K computer can be slowed by a factor of three and collective performance can be cut in half. On BG/Q, P2P performance can be slowed by a factor of five and collective performance can be slowed by a factor of ten. However, those numbers can be reduced by using an appropriate substitution method.

本文考虑了如何分配备用节点，如何替换故障节点，以及这种替换对通信性能的影响。第三个问题源于节点替换对秩映射的修改，这可能导致额外的消息冲突。在模板计算中，秩映射以一种直接的方式在笛卡尔网络上完成，而不会产生任何消息冲突。然而，一旦替换发生，节点-秩映射可能被破坏。因此，必须以最小化通信性能下降的方式回答这些问题。本文将提出几种备用节点分配和节点替代方法，并对替代后的通信性能进行分析和比较。当发生故障时，K计算机上的点对点(P2P)通信性能可以减慢三倍，集体性能可以减少一半。在BG/Q上，P2P性能可能会降低5倍，而集体性能可能会降低10倍。然而，这些数字可以通过使用适当的替代方法来减少。

{"title":"Sliding Substitution of Failed Nodes","authors":"A. Hori, Kazumi Yoshinaga, T. Hérault, Aurélien Bouteiller, G. Bosilca, Y. Ishikawa","doi":"10.1145/2802658.2802670","DOIUrl":"https://doi.org/10.1145/2802658.2802670","url":null,"abstract":"This paper considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the node- rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this paper, several spare-node allocation and nodesubstitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. It will be shown that when a failure occurs, the peer-to-peer (P2P) communication performance on the K computer can be slowed by a factor of three and collective performance can be cut in half. On BG/Q, P2P performance can be slowed by a factor of five and collective performance can be slowed by a factor of ten. However, those numbers can be reduced by using an appropriate substitution method.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116523598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

A Memory Management System Optimized for BDMPI's Memory and Execution Model 一个针对BDMPI的内存和执行模型进行优化的内存管理系统

Proceedings of the 22nd European MPI Users' Group Meeting

Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802666

J. Iverson, G. Karypis

There is a growing need to perform large computations on small systems, as access to large systems is not widely available and cannot keep up with the scaling of data. BDMPI was recently introduced as a way of achieving this for applications written in MPI. BDMPI allows the efficient execution of standard MPI programs on systems whose aggregate amount of memory is smaller than that required by the computations and significantly outperforms other approaches. In this paper we present a virtual memory subsystem which we implemented as part of the BDMPI runtime. Our new virtual memory subsystem, which we call SBMA, bypasses the operating system virtual memory manager to take advantage of BDMPI's node-level cooperative multi-taking. Benchmarking using a synthetic application shows that for the use cases relevant to BDMPI, the overhead incurred by the BDMPI-SBMA system is amortized such that it performs as fast as explicit data movement by the application developer. Furthermore, we tested SBMA with three different classes of applications and our results show that with no modification to the original MPI program, speedups from 2×--12× over a standard BDMPI implementation can be achieved for the included applications.

在小型系统上执行大型计算的需求越来越大，因为对大型系统的访问并不广泛，并且无法跟上数据的扩展。BDMPI最近被引入，作为用MPI编写的应用程序实现这一目标的一种方式。BDMPI允许在内存总量小于计算所需的系统上有效地执行标准MPI程序，并且显著优于其他方法。在本文中，我们提出了一个虚拟内存子系统，作为BDMPI运行时的一部分实现。我们的新虚拟内存子系统(我们称之为SBMA)绕过操作系统虚拟内存管理器来利用BDMPI的节点级协作多占用。使用合成应用程序进行基准测试表明，对于与BDMPI相关的用例，BDMPI- sbma系统产生的开销被分摊，从而使其执行速度与应用程序开发人员的显式数据移动一样快。此外，我们用三种不同类型的应用程序测试了SBMA，结果表明，在不修改原始MPI程序的情况下，对于所包含的应用程序，可以实现比标准BDMPI实现的速度提高2倍至12倍。

{"title":"A Memory Management System Optimized for BDMPI's Memory and Execution Model","authors":"J. Iverson, G. Karypis","doi":"10.1145/2802658.2802666","DOIUrl":"https://doi.org/10.1145/2802658.2802666","url":null,"abstract":"There is a growing need to perform large computations on small systems, as access to large systems is not widely available and cannot keep up with the scaling of data. BDMPI was recently introduced as a way of achieving this for applications written in MPI. BDMPI allows the efficient execution of standard MPI programs on systems whose aggregate amount of memory is smaller than that required by the computations and significantly outperforms other approaches. In this paper we present a virtual memory subsystem which we implemented as part of the BDMPI runtime. Our new virtual memory subsystem, which we call SBMA, bypasses the operating system virtual memory manager to take advantage of BDMPI's node-level cooperative multi-taking. Benchmarking using a synthetic application shows that for the use cases relevant to BDMPI, the overhead incurred by the BDMPI-SBMA system is amortized such that it performs as fast as explicit data movement by the application developer. Furthermore, we tested SBMA with three different classes of applications and our results show that with no modification to the original MPI program, speedups from 2×--12× over a standard BDMPI implementation can be achieved for the included applications.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132353643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Parallel Algorithm for Minimizing the Fleet Size in the Pickup and Delivery Problem with Time Windows 带时间窗口的取货问题中最小化车队规模的并行算法

Proceedings of the 22nd European MPI Users' Group Meeting

Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802673

Miroslaw Blocho, J. Nalepa

In this paper, we propose a parallel guided ejection search algorithm to minimize the fleet size in the NP-hard pickup and delivery problem with time windows. The parallel processes co-operate periodically to enhance the quality of results and to accelerate the convergence of computations. The experimental study shows that the parallel algorithm retrieves very high-quality results. Finally, we report 13 (22% of all considered benchmark tests) new world's best solutions.

本文提出了一种并行引导弹射搜索算法，用于求解带时间窗口的NP-hard取货问题中最小化车队规模的问题。并行进程定期合作以提高结果的质量并加速计算的收敛。实验研究表明，该算法能获得高质量的结果。最后，我们报告了13个(占所有基准测试的22%)新的世界最佳解决方案。

引用次数: 11

Scalable and Fault Tolerant Failure Detection and Consensus 可伸缩和容错故障检测和共识

Proceedings of the 22nd European MPI Users' Group Meeting

Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802660

Amogh Katti, G. D. Fatta, T. Naughton, C. Engelmann

Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.

未来的超大规模高性能计算系统将需要在频繁的组件故障下工作。MPI论坛的用户级故障缓解提案引入了一个操作MPI_Comm_shrink，用于同步失败进程列表中的活动进程，以便通过采用基于算法的容错技术，即使存在故障，应用程序也可以继续执行。这个MPI_Comm_shrink操作需要容错故障检测和一致性算法。本文提出并比较了两种新的故障检测和一致性算法。所提出的算法基于Gossip协议，具有固有的容错性和可扩展性。在极端尺度模拟器上对所提出的算法进行了实现和测试。结果表明，在这两种算法中，达到全局共识的Gossip循环数与系统规模成对数关系。第二种算法在内存和网络带宽使用方面也表现出更好的可扩展性，并且在实现全局共识方面具有完美的同步性。

引用次数: 23

GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks 非阻塞集体基准的gpu感知设计，实现和评估

Proceedings of the 22nd European MPI Users' Group Meeting

Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802672

A. Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan L. Perkins, H. Subramoni, D. Panda

As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important trend, especially with the introduction of non-blocking collective communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware Non-Blocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware benchmark and discuss the challenges associated with identifying and implementing performance parameters like overlap, latency, effect of MPI_Test() calls to progress communication, effect of independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate the efficacy of the proposed benchmark, we provide a comparative performance evaluation of GPU-Aware Non-Blocking Collectives in MVAPICH2 and OpenMPI.

随着我们向高效的百亿亿级系统迈进，像NVIDIA gpu这样的异构加速器正在成为现代HPC集群的重要计算组件。利用系统中可用的每个计算设备的每个周期变得非常重要。从网卡到gpu再到协处理器，异构计算资源是向前发展的方向。另一个重要的趋势，特别是在最新的MPI标准中引入了非阻塞集体通信，是通信与计算的重叠。它已经成为MVAPICH2和OpenMPI等消息传递库的重要设计目标。在本文中，我们提出了一个重要的基准，允许不同MPI库的用户评估gpu感知非阻塞集合的性能。主要的性能指标是重叠和延迟。我们提供了关于设计GPU感知基准的见解，并讨论了与识别和实现性能参数相关的挑战，例如重叠，延迟，MPI_Test()调用对进度通信的影响，在通信下进行重叠计算时独立GPU通信的影响，以及这种重叠计算的复杂性，目标和规模的影响。为了说明所提出的基准的有效性，我们在MVAPICH2和OpenMPI中提供了gpu感知非阻塞集体的比较性能评估。

{"title":"GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks","authors":"A. Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan L. Perkins, H. Subramoni, D. Panda","doi":"10.1145/2802658.2802672","DOIUrl":"https://doi.org/10.1145/2802658.2802672","url":null,"abstract":"As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important trend, especially with the introduction of non-blocking collective communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware Non-Blocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware benchmark and discuss the challenges associated with identifying and implementing performance parameters like overlap, latency, effect of MPI_Test() calls to progress communication, effect of independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate the efficacy of the proposed benchmark, we provide a comparative performance evaluation of GPU-Aware Non-Blocking Collectives in MVAPICH2 and OpenMPI.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126899411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Detecting Silent Data Corruption for Extreme-Scale MPI Applications 检测无声数据损坏的极端规模的MPI应用程序

Proceedings of the 22nd European MPI Users' Group Meeting

Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802665

L. Bautista-Gomez, F. Cappello

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is application-agnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect over 80% of corruptions, while incurring less than 1% of overhead. We show that the false positive rate is less than 1% and that when multi-bit corruptions are taken into account, the detection recall increases to over 95%.

下一代超级计算机预计将拥有更多的组件，同时每次操作消耗的能量要少几倍。这些趋势正将超级计算机的建造推向小型化和节能策略的极限。因此，软错误的数量预计将在未来几年急剧增加。虽然有适当的机制来纠正或至少检测一些软错误，但这些错误中有很大一部分没有被硬件注意到。这种无声的错误极具破坏性，因为它们会使应用程序无声地产生错误的结果。在这项工作中，我们提出了一种技术，利用高性能计算应用程序的某些属性来检测应用程序级别的静默错误。我们的技术仅基于应用程序数据集的行为检测损坏，并且与应用程序无关。我们提出了多个损坏检测器，并将它们耦合起来，以一种对用户透明的方式协同工作。我们证明，这种策略可以检测到80%以上的腐败，而产生的开销不到1%。我们表明，误报率小于1%，当考虑多比特损坏时，检测召回率增加到95%以上。

{"title":"Detecting Silent Data Corruption for Extreme-Scale MPI Applications","authors":"L. Bautista-Gomez, F. Cappello","doi":"10.1145/2802658.2802665","DOIUrl":"https://doi.org/10.1145/2802658.2802665","url":null,"abstract":"Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is application-agnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect over 80% of corruptions, while incurring less than 1% of overhead. We show that the false positive rate is less than 1% and that when multi-bit corruptions are taken into account, the detection recall increases to over 95%.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114497635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

MPI-focused Tracing with OTFX: An MPI-aware In-memory Event Tracing Extension to the Open Trace Format 2 以OTFX为中心的mpi跟踪:开放跟踪格式的mpi感知内存事件跟踪扩展

Proceedings of the 22nd European MPI Users' Group Meeting

Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802664

M. Wagner, J. Doleschal, A. Knüpfer

Performance analysis tools are more than ever inevitable to develop applications that utilize the enormous computing resources of high performance computing (HPC) systems. In event-based performance analysis the amount of collected data is one of the most urgent challenges. The resulting measurement bias caused by uncoordinated intermediate memory buffer flushes in the monitoring tool can render a meaningful analysis of the parallel behavior impossible. In this paper we address the impact of intermediate memory buffer flushes and present a method to avoid file interaction in the monitoring tool entirely. We propose an MPI-focused tracing approach that provides the complete MPI communication behavior and adapts the remaining application events to an amount that fits into a single memory buffer. We demonstrate the capabilities of our method with an MPI-focused prototype implementation of OTFX, based on the Open Trace Format 2, a state-of-the-art Open Source event tracing library used by the performance analysis tools Vampir, Scalasca, and Tau. In a comparison to OTF2 based on seven applications from different scientific domains, our prototype introduces in average 5.1% less overhead and reduces the trace size up to three orders of magnitude.

对于开发利用高性能计算(HPC)系统的巨大计算资源的应用程序，性能分析工具比以往任何时候都更加不可避免。在基于事件的性能分析中，收集的数据量是最紧迫的挑战之一。监视工具中不协调的中间内存缓冲区刷新导致的测量偏差可能导致无法对并行行为进行有意义的分析。在本文中，我们解决了中间内存缓冲区刷新的影响，并提出了一种在监控工具中完全避免文件交互的方法。我们提出了一种以MPI为中心的跟踪方法，该方法提供了完整的MPI通信行为，并将剩余的应用程序事件调整为适合单个内存缓冲区的数量。我们用一个以mpi为中心的OTFX原型实现演示了我们的方法的能力，该原型实现基于Open Trace Format 2，这是一个由性能分析工具Vampir、Scalasca和Tau使用的最先进的开源事件跟踪库。与基于来自不同科学领域的七个应用程序的OTF2相比，我们的原型平均减少了5.1%的开销，并将跟踪大小减少了三个数量级。

{"title":"MPI-focused Tracing with OTFX: An MPI-aware In-memory Event Tracing Extension to the Open Trace Format 2","authors":"M. Wagner, J. Doleschal, A. Knüpfer","doi":"10.1145/2802658.2802664","DOIUrl":"https://doi.org/10.1145/2802658.2802664","url":null,"abstract":"Performance analysis tools are more than ever inevitable to develop applications that utilize the enormous computing resources of high performance computing (HPC) systems. In event-based performance analysis the amount of collected data is one of the most urgent challenges. The resulting measurement bias caused by uncoordinated intermediate memory buffer flushes in the monitoring tool can render a meaningful analysis of the parallel behavior impossible. In this paper we address the impact of intermediate memory buffer flushes and present a method to avoid file interaction in the monitoring tool entirely. We propose an MPI-focused tracing approach that provides the complete MPI communication behavior and adapts the remaining application events to an amount that fits into a single memory buffer. We demonstrate the capabilities of our method with an MPI-focused prototype implementation of OTFX, based on the Open Trace Format 2, a state-of-the-art Open Source event tracing library used by the performance analysis tools Vampir, Scalasca, and Tau. In a comparison to OTF2 based on seven applications from different scientific domains, our prototype introduces in average 5.1% less overhead and reduces the trace size up to three orders of magnitude.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122276450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Toward Operating System Support for Scalable Multithreaded Message Passing 操作系统对可伸缩多线程消息传递的支持

Proceedings of the 22nd European MPI Users' Group Meeting

Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802661

Balazs Gerofi, Masamichi Takagi, Y. Ishikawa

Modern CPU architectures provide a large number of processing cores and application programmers are increasingly looking at hybrid programming models, where multiple threads of a single process interact with the MPI library simultaneously. Moreover, recent high-speed interconnection networks are being designed with capabilities targeting communication explicitly from multiple processor cores. As a result, scalability of the MPI library so that multithreaded applications can efficiently drive independent network communication has become a major concern. In this work, we propose a novel operating system level concept called the thread private shared library (TPSL), which enables threads of a multithreaded application to see specific shared libraries in a private fashion. Contrary to address spaces in traditional operating systems, where threads of a single process refer to the exact same set of virtual to physical mappings, our technique relies on per-thread separate page tables. Mapping the MPI library in a thread private fashion results in per-thread MPI ranks eliminating resource contention in the MPI library without the need for redesigning it. To demonstrate the benefits of our mechanism, we provide preliminary evaluation for various aspects of multithreaded MPI processing through micro-benchmarks on two widely used MPI implementations, MPICH and MVAPICH, with only minor modifications to the libraries.

现代CPU体系结构提供了大量的处理核心，应用程序程序员越来越关注混合编程模型，其中单个进程的多个线程同时与MPI库交互。此外，最近的高速互连网络被设计为具有明确针对多处理器核心通信的能力。因此，MPI库的可伸缩性使多线程应用程序能够有效地驱动独立的网络通信成为一个主要问题。在这项工作中，我们提出了一个新的操作系统级概念，称为线程私有共享库(TPSL)，它使多线程应用程序的线程能够以私有方式查看特定的共享库。与传统操作系统中的地址空间相反，在传统操作系统中，单个进程的线程引用完全相同的一组虚拟到物理映射，我们的技术依赖于每个线程单独的页表。以线程私有方式映射MPI库会产生每个线程的MPI排名，从而消除MPI库中的资源争用，而无需重新设计它。为了演示我们的机制的好处，我们通过在两种广泛使用的MPI实现(MPICH和MVAPICH)上进行微基准测试，对多线程MPI处理的各个方面进行了初步评估，仅对库进行了少量修改。

{"title":"Toward Operating System Support for Scalable Multithreaded Message Passing","authors":"Balazs Gerofi, Masamichi Takagi, Y. Ishikawa","doi":"10.1145/2802658.2802661","DOIUrl":"https://doi.org/10.1145/2802658.2802661","url":null,"abstract":"Modern CPU architectures provide a large number of processing cores and application programmers are increasingly looking at hybrid programming models, where multiple threads of a single process interact with the MPI library simultaneously. Moreover, recent high-speed interconnection networks are being designed with capabilities targeting communication explicitly from multiple processor cores. As a result, scalability of the MPI library so that multithreaded applications can efficiently drive independent network communication has become a major concern. In this work, we propose a novel operating system level concept called the thread private shared library (TPSL), which enables threads of a multithreaded application to see specific shared libraries in a private fashion. Contrary to address spaces in traditional operating systems, where threads of a single process refer to the exact same set of virtual to physical mappings, our technique relies on per-thread separate page tables. Mapping the MPI library in a thread private fashion results in per-thread MPI ranks eliminating resource contention in the MPI library without the need for redesigning it. To demonstrate the benefits of our mechanism, we provide preliminary evaluation for various aspects of multithreaded MPI processing through micro-benchmarks on two widely used MPI implementations, MPICH and MVAPICH, with only minor modifications to the libraries.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126945905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An MPI Halo-Cell Implementation for Zero-Copy Abstraction 零拷贝抽象的MPI光晕单元实现

Proceedings of the 22nd European MPI Users' Group Meeting

Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802669

Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger

In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain splitting on an increasing number of memory areas as an example problem where negative performance impact on computation could arise. We identify the specific parameters that drive scalability for this problem, and then model the halo-cell ratio on common mesh topologies to study the memory and communication implications. Such analysis argues for the use of shared-memory parallelism, such as with OpenMP, to address the performance problems that could occur. In contrast, we propose an original solution based entirely on MPI programming semantics, while providing the performance advantages of hybrid parallel programming. Our solution transparently replaces halo-cells transfers with pointer exchanges when MPI tasks are running on the same node, effectively removing memory copies. The results we present demonstrate gains in terms of memory and computation time on Xeon Phi (compared to OpenMP-only and MPI-only) using a representative domain decomposition benchmark.

在Exascale的竞争中，多核处理器的出现将带来并行计算体系结构的转变，使其具有更高的并发性，但每个线程的内存相对较小。这种转变引起了对高性能计算软件的适应性的关注，使当代人能够适应美丽的新世界。在本文中，我们研究了在越来越多的内存区域上的域分割，作为一个可能对计算产生负面性能影响的示例问题。我们确定了驱动该问题可扩展性的特定参数，然后在常见的网格拓扑上对halo-cell比率进行建模，以研究内存和通信含义。这种分析支持使用共享内存并行性(例如OpenMP)来解决可能出现的性能问题。相比之下，我们提出了一个完全基于MPI编程语义的原始解决方案，同时提供了混合并行编程的性能优势。当MPI任务在同一节点上运行时，我们的解决方案透明地用指针交换取代晕格传输，从而有效地删除内存副本。我们给出的结果显示了在Xeon Phi上(与仅openmp和仅mpi相比)使用代表性域分解基准在内存和计算时间方面的收益。

{"title":"An MPI Halo-Cell Implementation for Zero-Copy Abstraction","authors":"Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger","doi":"10.1145/2802658.2802669","DOIUrl":"https://doi.org/10.1145/2802658.2802669","url":null,"abstract":"In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain splitting on an increasing number of memory areas as an example problem where negative performance impact on computation could arise. We identify the specific parameters that drive scalability for this problem, and then model the halo-cell ratio on common mesh topologies to study the memory and communication implications. Such analysis argues for the use of shared-memory parallelism, such as with OpenMP, to address the performance problems that could occur. In contrast, we propose an original solution based entirely on MPI programming semantics, while providing the performance advantages of hybrid parallel programming. Our solution transparently replaces halo-cells transfers with pointer exchanges when MPI tasks are running on the same node, effectively removing memory copies. The results we present demonstrate gains in terms of memory and computation time on Xeon Phi (compared to OpenMP-only and MPI-only) using a representative domain decomposition benchmark.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133130107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Proceedings of the 22nd European MPI Users' Group Meeting 第22届欧洲MPI用户组会议论文集

Proceedings of the 22nd European MPI Users' Group Meeting

Pub Date : 1900-01-01 DOI: 10.1145/2802658

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 22nd European MPI Users' Group Meeting

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀