Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00060
Hua Huang, Edmond Chow
This paper presents the idea of overlapping communications with communications. Communication operations are overlapped, allowing actual data transfer in one operation to be overlapped with synchronization or other overheads in another operation, thus making more effective use of the available network bandwidth. We use two techniques for overlapping communication operations: a novel technique called "nonblocking overlap" that uses MPI-3 nonblocking collective operations and software pipelines, and a simpler technique that uses multiple MPI processes per node to send different portions of data simultaneously. The idea is applied to the parallel dense matrix squaring and cubing kernel in density matrix purification, an important kernel in electronic structure calculations. The kernel is up to 91.2% faster when communication operations are overlapped.
{"title":"Overlapping Communications with Other Communications and Its Application to Distributed Dense Matrix Computations","authors":"Hua Huang, Edmond Chow","doi":"10.1109/IPDPS.2019.00060","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00060","url":null,"abstract":"This paper presents the idea of overlapping communications with communications. Communication operations are overlapped, allowing actual data transfer in one operation to be overlapped with synchronization or other overheads in another operation, thus making more effective use of the available network bandwidth. We use two techniques for overlapping communication operations: a novel technique called \"nonblocking overlap\" that uses MPI-3 nonblocking collective operations and software pipelines, and a simpler technique that uses multiple MPI processes per node to send different portions of data simultaneously. The idea is applied to the parallel dense matrix squaring and cubing kernel in density matrix purification, an important kernel in electronic structure calculations. The kernel is up to 91.2% faster when communication operations are overlapped.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121296122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00066
Oleksandr Rudyy, M. Garcia-Gasulla, F. Mantovani, A. Santiago, R. Sirvent, M. Vázquez
Since the appearance of Docker in 2013, container technologies for computers have evolved and gained importance in cloud data centers. However, adoption of containers in High-Performance Computing (HPC) centers is still under discussion: on one hand, the ease in portability is very well accepted; on the other hand, the performance penalties and security issues introduced by the added software layers are often under scrutiny. Since very little evaluation of large production HPC codes running in containers is available, we provide in this paper a comparative study using a production simulation of a biological system. The simulation is performed using Alya, which is a computational fluid dynamics (CFD) code optimized for HPC environments and enabled to run multiphysics problems. In the paper, we analyze the productivity advantages of adopting containers for large HPC codes, and we quantify performance overhead induced by the use of three different container technologies (Docker, Singularity and Shifter) comparing it to native execution. Given the results of these tests, we selected Singularity as best technology, based on performance and portability. We show scalability results of Alya using singularity up to 256 computational nodes (up to 12k cores) of MareNostrum4 and present a study of performance and portability on three different HPC architectures (Intel Skylake, IBM Power9, and Arm-v8).
自2013年Docker出现以来,计算机容器技术已经发展并在云数据中心中变得越来越重要。然而,在高性能计算(HPC)中心采用容器仍在讨论中:一方面,易于移植性已被广泛接受;另一方面,添加的软件层带来的性能损失和安全问题经常受到审查。由于在容器中运行的大型生产HPC代码的评估很少,我们在本文中提供了一个使用生物系统的生产模拟的比较研究。模拟使用Alya进行,Alya是针对高性能计算环境优化的计算流体动力学(CFD)代码,能够运行多物理场问题。在本文中,我们分析了大型HPC代码采用容器的生产力优势,并量化了使用三种不同容器技术(Docker, Singularity和Shifter)与本机执行的比较所引起的性能开销。根据这些测试的结果,我们根据性能和可移植性选择了Singularity作为最佳技术。我们展示了Alya使用MareNostrum4最多256个计算节点(最多12k核)的可扩展性结果,并展示了三种不同HPC架构(英特尔Skylake, IBM Power9和Arm-v8)的性能和可移植性研究。
{"title":"Containers in HPC: A Scalability and Portability Study in Production Biological Simulations","authors":"Oleksandr Rudyy, M. Garcia-Gasulla, F. Mantovani, A. Santiago, R. Sirvent, M. Vázquez","doi":"10.1109/IPDPS.2019.00066","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00066","url":null,"abstract":"Since the appearance of Docker in 2013, container technologies for computers have evolved and gained importance in cloud data centers. However, adoption of containers in High-Performance Computing (HPC) centers is still under discussion: on one hand, the ease in portability is very well accepted; on the other hand, the performance penalties and security issues introduced by the added software layers are often under scrutiny. Since very little evaluation of large production HPC codes running in containers is available, we provide in this paper a comparative study using a production simulation of a biological system. The simulation is performed using Alya, which is a computational fluid dynamics (CFD) code optimized for HPC environments and enabled to run multiphysics problems. In the paper, we analyze the productivity advantages of adopting containers for large HPC codes, and we quantify performance overhead induced by the use of three different container technologies (Docker, Singularity and Shifter) comparing it to native execution. Given the results of these tests, we selected Singularity as best technology, based on performance and portability. We show scalability results of Alya using singularity up to 256 computational nodes (up to 12k cores) of MareNostrum4 and present a study of performance and portability on three different HPC architectures (Intel Skylake, IBM Power9, and Arm-v8).","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116263108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00044
T-H. Hubert Chan, Mauro Sozio, Bintao Sun
We design distributed algorithms to compute approximate solutions for several related graph optimization problems. All our algorithms have round complexity being logarithmic in the number of nodes of the underlying graph and in particular independent of the graph diameter. By using a primal-dual approach, we develop a 2(1+ε)-approximation algorithm for computing the coreness values of the nodes in the underlying graph, as well as a 2(1+ε)-approximation algorithm for the min-max edge orientation problem, where the goal is to orient the edges so as to minimize the maximum weighted in-degree. We provide lower bounds showing that the aforementioned algorithms are tight both in terms of the approximation guarantee and the round complexity. Finally, motivated by the fact that the densest subset problem has an inherent dependency on the diameter of the graph, we study a weaker version that does not suffer from the same limitation.
{"title":"Distributed Approximate k-Core Decomposition and Min-Max Edge Orientation: Breaking the Diameter Barrier","authors":"T-H. Hubert Chan, Mauro Sozio, Bintao Sun","doi":"10.1109/IPDPS.2019.00044","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00044","url":null,"abstract":"We design distributed algorithms to compute approximate solutions for several related graph optimization problems. All our algorithms have round complexity being logarithmic in the number of nodes of the underlying graph and in particular independent of the graph diameter. By using a primal-dual approach, we develop a 2(1+ε)-approximation algorithm for computing the coreness values of the nodes in the underlying graph, as well as a 2(1+ε)-approximation algorithm for the min-max edge orientation problem, where the goal is to orient the edges so as to minimize the maximum weighted in-degree. We provide lower bounds showing that the aforementioned algorithms are tight both in terms of the approximation guarantee and the round complexity. Finally, motivated by the fact that the densest subset problem has an inherent dependency on the diameter of the graph, we study a weaker version that does not suffer from the same limitation.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114172878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00115
Diksha Gupta, Jared Saia, Maxwell Young
A common tool to defend against Sybil attacks is proof-of-work, whereby computational puzzles are used to limit the number of Sybil participants. Unfortunately, current Sybil defenses require significant computational effort to offset an attack. In particular, good participants must spend computationally at a rate that is proportional to the spending rate of an attacker. In this paper, we present the first Sybil defense algorithm which is asymmetric in the sense that good participants spend at a rate that is asymptotically less than an attacker. In particular, if T is the rate of the attacker's spending, and J is the rate of joining good participants, then our algorithm spends at a rate of O(sqrt(TJ) + J). We provide empirical evidence that our algorithm can be significantly more efficient than previous defenses under various attack scenarios. Additionally, we prove a lower bound showing that our algorithm's spending rate is asymptotically optimal among a large family of algorithms.
{"title":"Peace Through Superior Puzzling: An Asymmetric Sybil Defense","authors":"Diksha Gupta, Jared Saia, Maxwell Young","doi":"10.1109/IPDPS.2019.00115","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00115","url":null,"abstract":"A common tool to defend against Sybil attacks is proof-of-work, whereby computational puzzles are used to limit the number of Sybil participants. Unfortunately, current Sybil defenses require significant computational effort to offset an attack. In particular, good participants must spend computationally at a rate that is proportional to the spending rate of an attacker. In this paper, we present the first Sybil defense algorithm which is asymmetric in the sense that good participants spend at a rate that is asymptotically less than an attacker. In particular, if T is the rate of the attacker's spending, and J is the rate of joining good participants, then our algorithm spends at a rate of O(sqrt(TJ) + J). We provide empirical evidence that our algorithm can be significantly more efficient than previous defenses under various attack scenarios. Additionally, we prove a lower bound showing that our algorithm's spending rate is asymptotically optimal among a large family of algorithms.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129579385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00041
Md. Vasimuddin, Sanchit Misra, Heng Li, S. Aluru
Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. For example, the Illumina NovaSeq 6000 sequencer can generate 6 Terabases of data in less than two days, sequencing nearly 20 Billion short DNA fragments called reads at the low cost of $1000 per human genome. Large sequencing centers typically employ hundreds of such systems. Such highthroughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK (Genome Analysis ToolKit) best practices workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing and is usually processed on clusters or cloud deployments with multicore processors usually being the platform of choice. Since the application can be easily parallelized across multiple sockets (even across distributed memory systems) by simply distributing the reads equally, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of the three kernels by 1) using techniques to improve cache reuse, 2) simplifying the algorithms, 3) replacing many small memory allocations with a few large contiguous ones to improve hardware prefetching of data, 4) software prefetching of data, and 5) utilization of SIMD wherever applicable and massive reorganization of the source code to enable these improvements. As a result, we achieved nearly 2×, 183×, and 8× speedups on the three kernels, respectively, resulting in up to 3:5× and 2:4× speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM (running on a single CPU) while using a single CPU or a single CPU-single GPGPU/FPGA combination. Source-code: https://github.com/bwa-mem2/bwa-mem2
{"title":"Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems","authors":"Md. Vasimuddin, Sanchit Misra, Heng Li, S. Aluru","doi":"10.1109/IPDPS.2019.00041","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00041","url":null,"abstract":"Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. For example, the Illumina NovaSeq 6000 sequencer can generate 6 Terabases of data in less than two days, sequencing nearly 20 Billion short DNA fragments called reads at the low cost of $1000 per human genome. Large sequencing centers typically employ hundreds of such systems. Such highthroughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK (Genome Analysis ToolKit) best practices workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing and is usually processed on clusters or cloud deployments with multicore processors usually being the platform of choice. Since the application can be easily parallelized across multiple sockets (even across distributed memory systems) by simply distributing the reads equally, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of the three kernels by 1) using techniques to improve cache reuse, 2) simplifying the algorithms, 3) replacing many small memory allocations with a few large contiguous ones to improve hardware prefetching of data, 4) software prefetching of data, and 5) utilization of SIMD wherever applicable and massive reorganization of the source code to enable these improvements. As a result, we achieved nearly 2×, 183×, and 8× speedups on the three kernels, respectively, resulting in up to 3:5× and 2:4× speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM (running on a single CPU) while using a single CPU or a single CPU-single GPGPU/FPGA combination. Source-code: https://github.com/bwa-mem2/bwa-mem2","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127075226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00099
Bogdan Nicolae, A. Moody, Elsa Gonsiorowski, K. Mohror, F. Cappello
Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.
{"title":"VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale","authors":"Bogdan Nicolae, A. Moody, Elsa Gonsiorowski, K. Mohror, F. Cappello","doi":"10.1109/IPDPS.2019.00099","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00099","url":null,"abstract":"Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132155700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00096
Luanzheng Guo, Dong Li
Understanding application resilience (or error tolerance) in the presence of hardware transient faults on data objects is critical to ensure computing integrity and enable efficient application-level fault tolerance mechanisms. However, we lack a method and a tool to quantify application resilience to transient faults on data objects. The traditional method, random fault injection, cannot help, because of losing data semantics and insufficient information on how and where errors are tolerated. In this paper, we introduce a method and a tool (called "MOARD") to model and quantify application resilience to transient faults on data objects. Our method is based on systematically quantifying error masking events caused by application-inherent semantics and program constructs. We use MOARD to study how and why errors in data objects can be tolerated by the application. We demonstrate tangible benefits of using MOARD to direct a fault tolerance mechanism to protect data objects.
{"title":"MOARD: Modeling Application Resilience to Transient Faults on Data Objects","authors":"Luanzheng Guo, Dong Li","doi":"10.1109/IPDPS.2019.00096","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00096","url":null,"abstract":"Understanding application resilience (or error tolerance) in the presence of hardware transient faults on data objects is critical to ensure computing integrity and enable efficient application-level fault tolerance mechanisms. However, we lack a method and a tool to quantify application resilience to transient faults on data objects. The traditional method, random fault injection, cannot help, because of losing data semantics and insufficient information on how and where errors are tolerated. In this paper, we introduce a method and a tool (called \"MOARD\") to model and quantify application resilience to transient faults on data objects. Our method is based on systematically quantifying error masking events caused by application-inherent semantics and program constructs. We use MOARD to study how and why errors in data objects can be tolerated by the application. We demonstrate tangible benefits of using MOARD to direct a fault tolerance mechanism to protect data objects.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"72 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114391148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00073
P. Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, A. Rountev, L. Pouchet, P. Sadayappan
Stencil computations are often the compute-intensive kernel in many scientific applications. With the increasing demand for computational accuracy, and the emergence of massively data-parallel high-bandwidth architectures like GPUs, stencils have steadily become more complex in terms of the stencil order, data accesses, and reuse patterns. Many prior efforts have focused on optimizing simpler stencil computations on various platforms. However, existing stencil code generators face challenges in optimizing such complex multi-statement stencil DAGs. This paper addresses the challenges in optimizing high-order stencil DAGs on GPUs by focusing on two key considerations: (1) enabling the domain expert to guide the code optimization, which may otherwise be extremely challenging for complex stencils; and (2) using bottleneck analysis via runtime profiling to guide the application of optimizations, and the tuning of various code generation parameters. We implement these abstractions in a prototype code generation framework termed Artemis, and evaluate its efficacy over multiple stencil kernels with varying complexity and operational intensity on an NVIDIA P100 GPU.
{"title":"On Optimizing Complex Stencils on GPUs","authors":"P. Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, A. Rountev, L. Pouchet, P. Sadayappan","doi":"10.1109/IPDPS.2019.00073","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00073","url":null,"abstract":"Stencil computations are often the compute-intensive kernel in many scientific applications. With the increasing demand for computational accuracy, and the emergence of massively data-parallel high-bandwidth architectures like GPUs, stencils have steadily become more complex in terms of the stencil order, data accesses, and reuse patterns. Many prior efforts have focused on optimizing simpler stencil computations on various platforms. However, existing stencil code generators face challenges in optimizing such complex multi-statement stencil DAGs. This paper addresses the challenges in optimizing high-order stencil DAGs on GPUs by focusing on two key considerations: (1) enabling the domain expert to guide the code optimization, which may otherwise be extremely challenging for complex stencils; and (2) using bottleneck analysis via runtime profiling to guide the application of optimizations, and the tuning of various code generation parameters. We implement these abstractions in a prototype code generation framework termed Artemis, and evaluate its efficacy over multiple stencil kernels with varying complexity and operational intensity on an NVIDIA P100 GPU.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132447835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00104
J. Bachan, S. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, Paul H. Hargrove, H. Ahmed
UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC). We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x. UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications.
{"title":"UPC++: A High-Performance Communication Framework for Asynchronous Computation","authors":"J. Bachan, S. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, Paul H. Hargrove, H. Ahmed","doi":"10.1109/IPDPS.2019.00104","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00104","url":null,"abstract":"UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC). We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x. UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129324749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00024
Kunal Agrawal, I. Lee, Jing Li, Kefu Lu, Benjamin Moseley
Many algorithms have been proposed to efficiently schedule parallel jobs on a multicore and/or multiprocessor machine to minimize average flow time, and the complexity of the problem is well understood. In practice, the problem is far from being understood. A reason for the gap between theory and practice is that all theoretical algorithms have prohibitive overheads in actual implementation including using many preemptions. One of the flagship successes of scheduling theory is the work-stealing scheduler. Work-stealing is used for optimizing the flow time of a single parallel job executing on a single machine with multiple cores and has a strong performance in theory and in practice. Consequently, it is implemented in almost all parallel runtime systems. This paper seeks to bridge theory and practice for scheduling parallel jobs that arrive online, by introducing an adaptation of the work-stealing scheduler for average flow time. The new algorithm Distributed Random Equi-Partition (DREP) has strong practical and theoretical performance. Practically, the algorithm has the following advantages: (1) it is non-clairvoyant; (2) all processors make scheduling decisions in a decentralized manner requiring minimal synchronization and communications; and (3) it requires a small and bounded number of preemptions. Theoretically, we prove that DREP is (4+ε)-speed O(1/ε^3)-competitive for average flow time. We have empirically evaluated DREP using both simulations and actual implementation by modifying the Cilk Plus work-stealing runtime system. The evaluation results show that DREP performs well compared to other scheduling strategies, including those that are theoretically good but cannot be faithfully implemented in practice.
{"title":"Practically Efficient Scheduler for Minimizing Average Flow Time of Parallel Jobs","authors":"Kunal Agrawal, I. Lee, Jing Li, Kefu Lu, Benjamin Moseley","doi":"10.1109/IPDPS.2019.00024","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00024","url":null,"abstract":"Many algorithms have been proposed to efficiently schedule parallel jobs on a multicore and/or multiprocessor machine to minimize average flow time, and the complexity of the problem is well understood. In practice, the problem is far from being understood. A reason for the gap between theory and practice is that all theoretical algorithms have prohibitive overheads in actual implementation including using many preemptions. One of the flagship successes of scheduling theory is the work-stealing scheduler. Work-stealing is used for optimizing the flow time of a single parallel job executing on a single machine with multiple cores and has a strong performance in theory and in practice. Consequently, it is implemented in almost all parallel runtime systems. This paper seeks to bridge theory and practice for scheduling parallel jobs that arrive online, by introducing an adaptation of the work-stealing scheduler for average flow time. The new algorithm Distributed Random Equi-Partition (DREP) has strong practical and theoretical performance. Practically, the algorithm has the following advantages: (1) it is non-clairvoyant; (2) all processors make scheduling decisions in a decentralized manner requiring minimal synchronization and communications; and (3) it requires a small and bounded number of preemptions. Theoretically, we prove that DREP is (4+ε)-speed O(1/ε^3)-competitive for average flow time. We have empirically evaluated DREP using both simulations and actual implementation by modifying the Cilk Plus work-stealing runtime system. The evaluation results show that DREP performs well compared to other scheduling strategies, including those that are theoretically good but cannot be faithfully implemented in practice.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131662877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}