A. Hori, Kazumi Yoshinaga, T. Hérault, Aurélien Bouteiller, G. Bosilca, Y. Ishikawa
This paper considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the node- rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this paper, several spare-node allocation and nodesubstitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. It will be shown that when a failure occurs, the peer-to-peer (P2P) communication performance on the K computer can be slowed by a factor of three and collective performance can be cut in half. On BG/Q, P2P performance can be slowed by a factor of five and collective performance can be slowed by a factor of ten. However, those numbers can be reduced by using an appropriate substitution method.
{"title":"Sliding Substitution of Failed Nodes","authors":"A. Hori, Kazumi Yoshinaga, T. Hérault, Aurélien Bouteiller, G. Bosilca, Y. Ishikawa","doi":"10.1145/2802658.2802670","DOIUrl":"https://doi.org/10.1145/2802658.2802670","url":null,"abstract":"This paper considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the node- rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this paper, several spare-node allocation and nodesubstitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. It will be shown that when a failure occurs, the peer-to-peer (P2P) communication performance on the K computer can be slowed by a factor of three and collective performance can be cut in half. On BG/Q, P2P performance can be slowed by a factor of five and collective performance can be slowed by a factor of ten. However, those numbers can be reduced by using an appropriate substitution method.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116523598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is a growing need to perform large computations on small systems, as access to large systems is not widely available and cannot keep up with the scaling of data. BDMPI was recently introduced as a way of achieving this for applications written in MPI. BDMPI allows the efficient execution of standard MPI programs on systems whose aggregate amount of memory is smaller than that required by the computations and significantly outperforms other approaches. In this paper we present a virtual memory subsystem which we implemented as part of the BDMPI runtime. Our new virtual memory subsystem, which we call SBMA, bypasses the operating system virtual memory manager to take advantage of BDMPI's node-level cooperative multi-taking. Benchmarking using a synthetic application shows that for the use cases relevant to BDMPI, the overhead incurred by the BDMPI-SBMA system is amortized such that it performs as fast as explicit data movement by the application developer. Furthermore, we tested SBMA with three different classes of applications and our results show that with no modification to the original MPI program, speedups from 2×--12× over a standard BDMPI implementation can be achieved for the included applications.
{"title":"A Memory Management System Optimized for BDMPI's Memory and Execution Model","authors":"J. Iverson, G. Karypis","doi":"10.1145/2802658.2802666","DOIUrl":"https://doi.org/10.1145/2802658.2802666","url":null,"abstract":"There is a growing need to perform large computations on small systems, as access to large systems is not widely available and cannot keep up with the scaling of data. BDMPI was recently introduced as a way of achieving this for applications written in MPI. BDMPI allows the efficient execution of standard MPI programs on systems whose aggregate amount of memory is smaller than that required by the computations and significantly outperforms other approaches. In this paper we present a virtual memory subsystem which we implemented as part of the BDMPI runtime. Our new virtual memory subsystem, which we call SBMA, bypasses the operating system virtual memory manager to take advantage of BDMPI's node-level cooperative multi-taking. Benchmarking using a synthetic application shows that for the use cases relevant to BDMPI, the overhead incurred by the BDMPI-SBMA system is amortized such that it performs as fast as explicit data movement by the application developer. Furthermore, we tested SBMA with three different classes of applications and our results show that with no modification to the original MPI program, speedups from 2×--12× over a standard BDMPI implementation can be achieved for the included applications.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132353643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a parallel guided ejection search algorithm to minimize the fleet size in the NP-hard pickup and delivery problem with time windows. The parallel processes co-operate periodically to enhance the quality of results and to accelerate the convergence of computations. The experimental study shows that the parallel algorithm retrieves very high-quality results. Finally, we report 13 (22% of all considered benchmark tests) new world's best solutions.
{"title":"A Parallel Algorithm for Minimizing the Fleet Size in the Pickup and Delivery Problem with Time Windows","authors":"Miroslaw Blocho, J. Nalepa","doi":"10.1145/2802658.2802673","DOIUrl":"https://doi.org/10.1145/2802658.2802673","url":null,"abstract":"In this paper, we propose a parallel guided ejection search algorithm to minimize the fleet size in the NP-hard pickup and delivery problem with time windows. The parallel processes co-operate periodically to enhance the quality of results and to accelerate the convergence of computations. The experimental study shows that the parallel algorithm retrieves very high-quality results. Finally, we report 13 (22% of all considered benchmark tests) new world's best solutions.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124168238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amogh Katti, G. D. Fatta, T. Naughton, C. Engelmann
Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.
{"title":"Scalable and Fault Tolerant Failure Detection and Consensus","authors":"Amogh Katti, G. D. Fatta, T. Naughton, C. Engelmann","doi":"10.1145/2802658.2802660","DOIUrl":"https://doi.org/10.1145/2802658.2802660","url":null,"abstract":"Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125318892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan L. Perkins, H. Subramoni, D. Panda
As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important trend, especially with the introduction of non-blocking collective communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware Non-Blocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware benchmark and discuss the challenges associated with identifying and implementing performance parameters like overlap, latency, effect of MPI_Test() calls to progress communication, effect of independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate the efficacy of the proposed benchmark, we provide a comparative performance evaluation of GPU-Aware Non-Blocking Collectives in MVAPICH2 and OpenMPI.
{"title":"GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks","authors":"A. Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan L. Perkins, H. Subramoni, D. Panda","doi":"10.1145/2802658.2802672","DOIUrl":"https://doi.org/10.1145/2802658.2802672","url":null,"abstract":"As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important trend, especially with the introduction of non-blocking collective communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware Non-Blocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware benchmark and discuss the challenges associated with identifying and implementing performance parameters like overlap, latency, effect of MPI_Test() calls to progress communication, effect of independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate the efficacy of the proposed benchmark, we provide a comparative performance evaluation of GPU-Aware Non-Blocking Collectives in MVAPICH2 and OpenMPI.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126899411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is application-agnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect over 80% of corruptions, while incurring less than 1% of overhead. We show that the false positive rate is less than 1% and that when multi-bit corruptions are taken into account, the detection recall increases to over 95%.
{"title":"Detecting Silent Data Corruption for Extreme-Scale MPI Applications","authors":"L. Bautista-Gomez, F. Cappello","doi":"10.1145/2802658.2802665","DOIUrl":"https://doi.org/10.1145/2802658.2802665","url":null,"abstract":"Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is application-agnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect over 80% of corruptions, while incurring less than 1% of overhead. We show that the false positive rate is less than 1% and that when multi-bit corruptions are taken into account, the detection recall increases to over 95%.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114497635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Performance analysis tools are more than ever inevitable to develop applications that utilize the enormous computing resources of high performance computing (HPC) systems. In event-based performance analysis the amount of collected data is one of the most urgent challenges. The resulting measurement bias caused by uncoordinated intermediate memory buffer flushes in the monitoring tool can render a meaningful analysis of the parallel behavior impossible. In this paper we address the impact of intermediate memory buffer flushes and present a method to avoid file interaction in the monitoring tool entirely. We propose an MPI-focused tracing approach that provides the complete MPI communication behavior and adapts the remaining application events to an amount that fits into a single memory buffer. We demonstrate the capabilities of our method with an MPI-focused prototype implementation of OTFX, based on the Open Trace Format 2, a state-of-the-art Open Source event tracing library used by the performance analysis tools Vampir, Scalasca, and Tau. In a comparison to OTF2 based on seven applications from different scientific domains, our prototype introduces in average 5.1% less overhead and reduces the trace size up to three orders of magnitude.
对于开发利用高性能计算(HPC)系统的巨大计算资源的应用程序,性能分析工具比以往任何时候都更加不可避免。在基于事件的性能分析中,收集的数据量是最紧迫的挑战之一。监视工具中不协调的中间内存缓冲区刷新导致的测量偏差可能导致无法对并行行为进行有意义的分析。在本文中,我们解决了中间内存缓冲区刷新的影响,并提出了一种在监控工具中完全避免文件交互的方法。我们提出了一种以MPI为中心的跟踪方法,该方法提供了完整的MPI通信行为,并将剩余的应用程序事件调整为适合单个内存缓冲区的数量。我们用一个以mpi为中心的OTFX原型实现演示了我们的方法的能力,该原型实现基于Open Trace Format 2,这是一个由性能分析工具Vampir、Scalasca和Tau使用的最先进的开源事件跟踪库。与基于来自不同科学领域的七个应用程序的OTF2相比,我们的原型平均减少了5.1%的开销,并将跟踪大小减少了三个数量级。
{"title":"MPI-focused Tracing with OTFX: An MPI-aware In-memory Event Tracing Extension to the Open Trace Format 2","authors":"M. Wagner, J. Doleschal, A. Knüpfer","doi":"10.1145/2802658.2802664","DOIUrl":"https://doi.org/10.1145/2802658.2802664","url":null,"abstract":"Performance analysis tools are more than ever inevitable to develop applications that utilize the enormous computing resources of high performance computing (HPC) systems. In event-based performance analysis the amount of collected data is one of the most urgent challenges. The resulting measurement bias caused by uncoordinated intermediate memory buffer flushes in the monitoring tool can render a meaningful analysis of the parallel behavior impossible. In this paper we address the impact of intermediate memory buffer flushes and present a method to avoid file interaction in the monitoring tool entirely. We propose an MPI-focused tracing approach that provides the complete MPI communication behavior and adapts the remaining application events to an amount that fits into a single memory buffer. We demonstrate the capabilities of our method with an MPI-focused prototype implementation of OTFX, based on the Open Trace Format 2, a state-of-the-art Open Source event tracing library used by the performance analysis tools Vampir, Scalasca, and Tau. In a comparison to OTF2 based on seven applications from different scientific domains, our prototype introduces in average 5.1% less overhead and reduces the trace size up to three orders of magnitude.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122276450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern CPU architectures provide a large number of processing cores and application programmers are increasingly looking at hybrid programming models, where multiple threads of a single process interact with the MPI library simultaneously. Moreover, recent high-speed interconnection networks are being designed with capabilities targeting communication explicitly from multiple processor cores. As a result, scalability of the MPI library so that multithreaded applications can efficiently drive independent network communication has become a major concern. In this work, we propose a novel operating system level concept called the thread private shared library (TPSL), which enables threads of a multithreaded application to see specific shared libraries in a private fashion. Contrary to address spaces in traditional operating systems, where threads of a single process refer to the exact same set of virtual to physical mappings, our technique relies on per-thread separate page tables. Mapping the MPI library in a thread private fashion results in per-thread MPI ranks eliminating resource contention in the MPI library without the need for redesigning it. To demonstrate the benefits of our mechanism, we provide preliminary evaluation for various aspects of multithreaded MPI processing through micro-benchmarks on two widely used MPI implementations, MPICH and MVAPICH, with only minor modifications to the libraries.
{"title":"Toward Operating System Support for Scalable Multithreaded Message Passing","authors":"Balazs Gerofi, Masamichi Takagi, Y. Ishikawa","doi":"10.1145/2802658.2802661","DOIUrl":"https://doi.org/10.1145/2802658.2802661","url":null,"abstract":"Modern CPU architectures provide a large number of processing cores and application programmers are increasingly looking at hybrid programming models, where multiple threads of a single process interact with the MPI library simultaneously. Moreover, recent high-speed interconnection networks are being designed with capabilities targeting communication explicitly from multiple processor cores. As a result, scalability of the MPI library so that multithreaded applications can efficiently drive independent network communication has become a major concern. In this work, we propose a novel operating system level concept called the thread private shared library (TPSL), which enables threads of a multithreaded application to see specific shared libraries in a private fashion. Contrary to address spaces in traditional operating systems, where threads of a single process refer to the exact same set of virtual to physical mappings, our technique relies on per-thread separate page tables. Mapping the MPI library in a thread private fashion results in per-thread MPI ranks eliminating resource contention in the MPI library without the need for redesigning it. To demonstrate the benefits of our mechanism, we provide preliminary evaluation for various aspects of multithreaded MPI processing through micro-benchmarks on two widely used MPI implementations, MPICH and MVAPICH, with only minor modifications to the libraries.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126945905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger
In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain splitting on an increasing number of memory areas as an example problem where negative performance impact on computation could arise. We identify the specific parameters that drive scalability for this problem, and then model the halo-cell ratio on common mesh topologies to study the memory and communication implications. Such analysis argues for the use of shared-memory parallelism, such as with OpenMP, to address the performance problems that could occur. In contrast, we propose an original solution based entirely on MPI programming semantics, while providing the performance advantages of hybrid parallel programming. Our solution transparently replaces halo-cells transfers with pointer exchanges when MPI tasks are running on the same node, effectively removing memory copies. The results we present demonstrate gains in terms of memory and computation time on Xeon Phi (compared to OpenMP-only and MPI-only) using a representative domain decomposition benchmark.
{"title":"An MPI Halo-Cell Implementation for Zero-Copy Abstraction","authors":"Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger","doi":"10.1145/2802658.2802669","DOIUrl":"https://doi.org/10.1145/2802658.2802669","url":null,"abstract":"In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain splitting on an increasing number of memory areas as an example problem where negative performance impact on computation could arise. We identify the specific parameters that drive scalability for this problem, and then model the halo-cell ratio on common mesh topologies to study the memory and communication implications. Such analysis argues for the use of shared-memory parallelism, such as with OpenMP, to address the performance problems that could occur. In contrast, we propose an original solution based entirely on MPI programming semantics, while providing the performance advantages of hybrid parallel programming. Our solution transparently replaces halo-cells transfers with pointer exchanges when MPI tasks are running on the same node, effectively removing memory copies. The results we present demonstrate gains in terms of memory and computation time on Xeon Phi (compared to OpenMP-only and MPI-only) using a representative domain decomposition benchmark.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133130107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 22nd European MPI Users' Group Meeting","authors":"","doi":"10.1145/2802658","DOIUrl":"https://doi.org/10.1145/2802658","url":null,"abstract":"","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"304 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122775196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}