2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第6页

Overlapping Communications with Other Communications and Its Application to Distributed Dense Matrix Computations 重叠通信及其在分布密集矩阵计算中的应用

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00060

Hua Huang, Edmond Chow

This paper presents the idea of overlapping communications with communications. Communication operations are overlapped, allowing actual data transfer in one operation to be overlapped with synchronization or other overheads in another operation, thus making more effective use of the available network bandwidth. We use two techniques for overlapping communication operations: a novel technique called "nonblocking overlap" that uses MPI-3 nonblocking collective operations and software pipelines, and a simpler technique that uses multiple MPI processes per node to send different portions of data simultaneously. The idea is applied to the parallel dense matrix squaring and cubing kernel in density matrix purification, an important kernel in electronic structure calculations. The kernel is up to 91.2% faster when communication operations are overlapped.

本文提出了通信与通信重叠的思想。通信操作是重叠的，允许一个操作中的实际数据传输与另一个操作中的同步或其他开销重叠，从而更有效地利用可用的网络带宽。我们使用两种技术进行重叠通信操作:一种称为“非阻塞重叠”的新技术，它使用MPI-3非阻塞集体操作和软件管道;另一种更简单的技术，它使用每个节点的多个MPI进程同时发送不同部分的数据。将该思想应用于电子结构计算中的一个重要核心——密度矩阵净化中的并行密度矩阵平方核和立方核。当通信操作重叠时，内核的速度提高了91.2%。

引用次数: 1

Containers in HPC: A Scalability and Portability Study in Production Biological Simulations 高性能计算中的容器:生产生物模拟的可扩展性和可移植性研究

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00066

Oleksandr Rudyy, M. Garcia-Gasulla, F. Mantovani, A. Santiago, R. Sirvent, M. Vázquez

Since the appearance of Docker in 2013, container technologies for computers have evolved and gained importance in cloud data centers. However, adoption of containers in High-Performance Computing (HPC) centers is still under discussion: on one hand, the ease in portability is very well accepted; on the other hand, the performance penalties and security issues introduced by the added software layers are often under scrutiny. Since very little evaluation of large production HPC codes running in containers is available, we provide in this paper a comparative study using a production simulation of a biological system. The simulation is performed using Alya, which is a computational fluid dynamics (CFD) code optimized for HPC environments and enabled to run multiphysics problems. In the paper, we analyze the productivity advantages of adopting containers for large HPC codes, and we quantify performance overhead induced by the use of three different container technologies (Docker, Singularity and Shifter) comparing it to native execution. Given the results of these tests, we selected Singularity as best technology, based on performance and portability. We show scalability results of Alya using singularity up to 256 computational nodes (up to 12k cores) of MareNostrum4 and present a study of performance and portability on three different HPC architectures (Intel Skylake, IBM Power9, and Arm-v8).

自2013年Docker出现以来，计算机容器技术已经发展并在云数据中心中变得越来越重要。然而，在高性能计算(HPC)中心采用容器仍在讨论中:一方面，易于移植性已被广泛接受;另一方面，添加的软件层带来的性能损失和安全问题经常受到审查。由于在容器中运行的大型生产HPC代码的评估很少，我们在本文中提供了一个使用生物系统的生产模拟的比较研究。模拟使用Alya进行，Alya是针对高性能计算环境优化的计算流体动力学(CFD)代码，能够运行多物理场问题。在本文中，我们分析了大型HPC代码采用容器的生产力优势，并量化了使用三种不同容器技术(Docker, Singularity和Shifter)与本机执行的比较所引起的性能开销。根据这些测试的结果，我们根据性能和可移植性选择了Singularity作为最佳技术。我们展示了Alya使用MareNostrum4最多256个计算节点(最多12k核)的可扩展性结果，并展示了三种不同HPC架构(英特尔Skylake, IBM Power9和Arm-v8)的性能和可移植性研究。

{"title":"Containers in HPC: A Scalability and Portability Study in Production Biological Simulations","authors":"Oleksandr Rudyy, M. Garcia-Gasulla, F. Mantovani, A. Santiago, R. Sirvent, M. Vázquez","doi":"10.1109/IPDPS.2019.00066","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00066","url":null,"abstract":"Since the appearance of Docker in 2013, container technologies for computers have evolved and gained importance in cloud data centers. However, adoption of containers in High-Performance Computing (HPC) centers is still under discussion: on one hand, the ease in portability is very well accepted; on the other hand, the performance penalties and security issues introduced by the added software layers are often under scrutiny. Since very little evaluation of large production HPC codes running in containers is available, we provide in this paper a comparative study using a production simulation of a biological system. The simulation is performed using Alya, which is a computational fluid dynamics (CFD) code optimized for HPC environments and enabled to run multiphysics problems. In the paper, we analyze the productivity advantages of adopting containers for large HPC codes, and we quantify performance overhead induced by the use of three different container technologies (Docker, Singularity and Shifter) comparing it to native execution. Given the results of these tests, we selected Singularity as best technology, based on performance and portability. We show scalability results of Alya using singularity up to 256 computational nodes (up to 12k cores) of MareNostrum4 and present a study of performance and portability on three different HPC architectures (Intel Skylake, IBM Power9, and Arm-v8).","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116263108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Distributed Approximate k-Core Decomposition and Min-Max Edge Orientation: Breaking the Diameter Barrier 分布近似k核分解和最小-最大边缘定向:突破直径障碍

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00044

T-H. Hubert Chan, Mauro Sozio, Bintao Sun

We design distributed algorithms to compute approximate solutions for several related graph optimization problems. All our algorithms have round complexity being logarithmic in the number of nodes of the underlying graph and in particular independent of the graph diameter. By using a primal-dual approach, we develop a 2(1+ε)-approximation algorithm for computing the coreness values of the nodes in the underlying graph, as well as a 2(1+ε)-approximation algorithm for the min-max edge orientation problem, where the goal is to orient the edges so as to minimize the maximum weighted in-degree. We provide lower bounds showing that the aforementioned algorithms are tight both in terms of the approximation guarantee and the round complexity. Finally, motivated by the fact that the densest subset problem has an inherent dependency on the diameter of the graph, we study a weaker version that does not suffer from the same limitation.

我们设计了分布式算法来计算几个相关图优化问题的近似解。我们所有的算法在底层图的节点数量上都具有对数的圆复杂度，特别是与图的直径无关。通过使用原始对偶方法，我们开发了用于计算底层图中节点核心值的2(1+ε)近似算法，以及用于最小-最大边缘定向问题的2(1+ε)近似算法，其中目标是定向边缘以最小化最大加权in度。我们提供了下界，表明上述算法在近似保证和轮复杂度方面都是严格的。最后，由于最密集子集问题对图的直径具有固有的依赖性，我们研究了一个不受相同限制的较弱版本。

引用次数: 5

Peace Through Superior Puzzling: An Asymmetric Sybil Defense 通过高级谜题获得和平:不对称的西比尔防御

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00115

Diksha Gupta, Jared Saia, Maxwell Young

A common tool to defend against Sybil attacks is proof-of-work, whereby computational puzzles are used to limit the number of Sybil participants. Unfortunately, current Sybil defenses require significant computational effort to offset an attack. In particular, good participants must spend computationally at a rate that is proportional to the spending rate of an attacker. In this paper, we present the first Sybil defense algorithm which is asymmetric in the sense that good participants spend at a rate that is asymptotically less than an attacker. In particular, if T is the rate of the attacker's spending, and J is the rate of joining good participants, then our algorithm spends at a rate of O(sqrt(TJ) + J). We provide empirical evidence that our algorithm can be significantly more efficient than previous defenses under various attack scenarios. Additionally, we prove a lower bound showing that our algorithm's spending rate is asymptotically optimal among a large family of algorithms.

防御Sybil攻击的常用工具是工作量证明，即使用计算谜题来限制Sybil参与者的数量。不幸的是，当前的Sybil防御需要大量的计算工作来抵消攻击。特别是，优秀的参与者必须以与攻击者的支出率成比例的速度进行计算。在本文中，我们提出了第一个Sybil防御算法，该算法是不对称的，即好的参与者花费的速率渐近小于攻击者。特别是，如果T是攻击者的花费率，J是加入好的参与者的速度，那么我们的算法的花费率为O(sqrt(TJ) + J)。我们提供的经验证据表明，在各种攻击场景下，我们的算法比以前的防御效率要高得多。此外，我们证明了一个下界，表明我们的算法的花费率在一大类算法中是渐近最优的。

引用次数: 15

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems 多核系统中BWA-MEM的高效体系结构感知加速

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00041

Md. Vasimuddin, Sanchit Misra, Heng Li, S. Aluru

Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. For example, the Illumina NovaSeq 6000 sequencer can generate 6 Terabases of data in less than two days, sequencing nearly 20 Billion short DNA fragments called reads at the low cost of $1000 per human genome. Large sequencing centers typically employ hundreds of such systems. Such highthroughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK (Genome Analysis ToolKit) best practices workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing and is usually processed on clusters or cloud deployments with multicore processors usually being the platform of choice. Since the application can be easily parallelized across multiple sockets (even across distributed memory systems) by simply distributing the reads equally, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of the three kernels by 1) using techniques to improve cache reuse, 2) simplifying the algorithms, 3) replacing many small memory allocations with a few large contiguous ones to improve hardware prefetching of data, 4) software prefetching of data, and 5) utilization of SIMD wherever applicable and massive reorganization of the source code to enable these improvements. As a result, we achieved nearly 2×, 183×, and 8× speedups on the three kernels, respectively, resulting in up to 3:5× and 2:4× speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM (running on a single CPU) while using a single CPU or a single CPU-single GPGPU/FPGA combination. Source-code: https://github.com/bwa-mem2/bwa-mem2

新一代测序技术的创新使DNA序列数据的生成速度更快，成本更低。例如，Illumina NovaSeq 6000测序仪可以在不到两天的时间内产生6tb的数据，测序近200亿个称为reads的短DNA片段，每个人类基因组的成本低至1000美元。大型测序中心通常使用数百个这样的系统。这种高通量和低成本的数据生成强调了在测序数据的下游计算分析中需要相应的加速。下游分析的一个基本步骤是将reads映射到较长的参考DNA序列，例如参考人类基因组。序列定位是一个计算密集型的步骤，占GATK (Genome Analysis ToolKit)最佳实践工作流程总时间的30%以上。BWA-MEM是应用最广泛的序列映射工具之一，拥有数以万计的用户。在这项工作中，我们专注于通过有效的体系结构感知实现来加速BWA-MEM，同时保持相同的输出。大量的数据需要分布式计算，通常在集群或云部署上处理，多核处理器通常是选择的平台。由于应用程序可以很容易地跨多个套接字并行化(甚至跨分布式内存系统)，只需均匀地分配读操作，因此我们将重点放在单个套接字多核处理器的性能改进上。BWA-MEM运行时由三个内核主导，它们总共占总计算时间的85%以上。我们通过以下方式提高了这三个内核的性能:1)使用技术来提高缓存重用，2)简化算法，3)用几个大的连续内存分配替换许多小的内存分配，以改进数据的硬件预取，4)软件预取数据，以及5)在适用的情况下利用SIMD，并对源代码进行大规模重组以实现这些改进。因此，我们在三个内核上分别实现了近2倍、183倍和8倍的速度提升，在单线程和Intel至强Skylake处理器的单插槽上，与原始的BWA-MEM相比，端到端计算时间的速度提升了3:5倍和2:4倍。据我们所知，这是在使用单个CPU或单个CPU-单个GPGPU/FPGA组合时比BWA-MEM(在单个CPU上运行)报告的最高加速。源代码:https://github.com/bwa-mem2/bwa-mem2

{"title":"Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems","authors":"Md. Vasimuddin, Sanchit Misra, Heng Li, S. Aluru","doi":"10.1109/IPDPS.2019.00041","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00041","url":null,"abstract":"Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. For example, the Illumina NovaSeq 6000 sequencer can generate 6 Terabases of data in less than two days, sequencing nearly 20 Billion short DNA fragments called reads at the low cost of $1000 per human genome. Large sequencing centers typically employ hundreds of such systems. Such highthroughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK (Genome Analysis ToolKit) best practices workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing and is usually processed on clusters or cloud deployments with multicore processors usually being the platform of choice. Since the application can be easily parallelized across multiple sockets (even across distributed memory systems) by simply distributing the reads equally, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of the three kernels by 1) using techniques to improve cache reuse, 2) simplifying the algorithms, 3) replacing many small memory allocations with a few large contiguous ones to improve hardware prefetching of data, 4) software prefetching of data, and 5) utilization of SIMD wherever applicable and massive reorganization of the source code to enable these improvements. As a result, we achieved nearly 2×, 183×, and 8× speedups on the three kernels, respectively, resulting in up to 3:5× and 2:4× speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM (running on a single CPU) while using a single CPU or a single CPU-single GPGPU/FPGA combination. Source-code: https://github.com/bwa-mem2/bwa-mem2","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127075226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 525

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale VeloC:迈向大规模高性能自适应异步检查点

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00099

Bogdan Nicolae, A. Moody, Elsa Gonsiorowski, K. Mohror, F. Cappello

Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.

指向外部存储(例如，并行文件系统)的全局检查点是许多HPC应用程序的常见I/O模式。但是，由于外部存储的I/O吞吐量有限，全局检查点通常会导致I/O瓶颈。为了解决这个问题，越来越多的人开始采用从同步检查点(即，在写入完成之前阻塞)到异步检查点(即，写入更快的本地存储并在后台刷新到外部存储)的转变。然而，随着每个节点核心数的增加以及本地和外部存储的异构性，由于节点本地和全局级别的高并发性和I/O性能可变性之间复杂的相互作用，设计有效的异步检查点机制是非常重要的。这个问题还没有被很好地理解，但对现代超级计算基础设施非常重要。本文提出了一个通用的异步检查点解决方案来解决这个问题。为此，我们引入了一种并发优化技术，该技术将性能建模与轻量级监控相结合，以便对使用哪些本地存储设备做出明智的决策，从而动态地适应后台刷新并减少检查点开销。我们使用VeloC原型来说明这种技术。在前百亿亿次超级计算系统上进行的大量实验显示了显著的好处。

{"title":"VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale","authors":"Bogdan Nicolae, A. Moody, Elsa Gonsiorowski, K. Mohror, F. Cappello","doi":"10.1109/IPDPS.2019.00099","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00099","url":null,"abstract":"Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132155700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

MOARD: Modeling Application Resilience to Transient Faults on Data Objects MOARD:数据对象瞬时故障的应用弹性建模

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00096

Luanzheng Guo, Dong Li

Understanding application resilience (or error tolerance) in the presence of hardware transient faults on data objects is critical to ensure computing integrity and enable efficient application-level fault tolerance mechanisms. However, we lack a method and a tool to quantify application resilience to transient faults on data objects. The traditional method, random fault injection, cannot help, because of losing data semantics and insufficient information on how and where errors are tolerated. In this paper, we introduce a method and a tool (called "MOARD") to model and quantify application resilience to transient faults on data objects. Our method is based on systematically quantifying error masking events caused by application-inherent semantics and program constructs. We use MOARD to study how and why errors in data objects can be tolerated by the application. We demonstrate tangible benefits of using MOARD to direct a fault tolerance mechanism to protect data objects.

在数据对象上存在硬件瞬态故障时，理解应用程序弹性(或容错性)对于确保计算完整性和启用有效的应用程序级容错机制至关重要。然而，我们缺乏一种方法和工具来量化应用程序对数据对象上的瞬时故障的弹性。传统的随机故障注入方法无法提供帮助，因为它丢失了数据语义，并且关于如何容忍错误以及在何处容忍错误的信息不足。在本文中，我们介绍了一种方法和工具(称为“MOARD”)来建模和量化数据对象上的瞬时故障的应用程序弹性。我们的方法是基于系统地量化由应用程序固有语义和程序结构引起的错误屏蔽事件。我们使用MOARD来研究应用程序如何以及为什么可以容忍数据对象中的错误。我们演示了使用MOARD指导容错机制来保护数据对象的切实好处。

引用次数: 10

On Optimizing Complex Stencils on GPUs 基于gpu的复杂模板优化研究

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00073

P. Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, A. Rountev, L. Pouchet, P. Sadayappan

Stencil computations are often the compute-intensive kernel in many scientific applications. With the increasing demand for computational accuracy, and the emergence of massively data-parallel high-bandwidth architectures like GPUs, stencils have steadily become more complex in terms of the stencil order, data accesses, and reuse patterns. Many prior efforts have focused on optimizing simpler stencil computations on various platforms. However, existing stencil code generators face challenges in optimizing such complex multi-statement stencil DAGs. This paper addresses the challenges in optimizing high-order stencil DAGs on GPUs by focusing on two key considerations: (1) enabling the domain expert to guide the code optimization, which may otherwise be extremely challenging for complex stencils; and (2) using bottleneck analysis via runtime profiling to guide the application of optimizations, and the tuning of various code generation parameters. We implement these abstractions in a prototype code generation framework termed Artemis, and evaluate its efficacy over multiple stencil kernels with varying complexity and operational intensity on an NVIDIA P100 GPU.

在许多科学应用中，模板计算通常是计算密集型的核心。随着对计算精度的要求不断提高，以及大规模数据并行高带宽架构(如gpu)的出现，模板在模板顺序、数据访问和重用模式方面变得越来越复杂。许多先前的努力都集中在各种平台上优化更简单的模板计算。然而，现有的模板代码生成器在优化这种复杂的多语句模板dag方面面临着挑战。本文通过关注两个关键因素来解决在gpu上优化高阶模板dag的挑战:(1)使领域专家能够指导代码优化，否则这对于复杂的模板来说可能是极具挑战性的;(2)利用瓶颈分析通过运行时分析来指导应用程序的优化，以及各种代码生成参数的调优。我们在一个名为Artemis的原型代码生成框架中实现了这些抽象，并在NVIDIA P100 GPU上评估了其在多个具有不同复杂性和操作强度的模板内核上的有效性。

{"title":"On Optimizing Complex Stencils on GPUs","authors":"P. Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, A. Rountev, L. Pouchet, P. Sadayappan","doi":"10.1109/IPDPS.2019.00073","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00073","url":null,"abstract":"Stencil computations are often the compute-intensive kernel in many scientific applications. With the increasing demand for computational accuracy, and the emergence of massively data-parallel high-bandwidth architectures like GPUs, stencils have steadily become more complex in terms of the stencil order, data accesses, and reuse patterns. Many prior efforts have focused on optimizing simpler stencil computations on various platforms. However, existing stencil code generators face challenges in optimizing such complex multi-statement stencil DAGs. This paper addresses the challenges in optimizing high-order stencil DAGs on GPUs by focusing on two key considerations: (1) enabling the domain expert to guide the code optimization, which may otherwise be extremely challenging for complex stencils; and (2) using bottleneck analysis via runtime profiling to guide the application of optimizations, and the tuning of various code generation parameters. We implement these abstractions in a prototype code generation framework termed Artemis, and evaluate its efficacy over multiple stencil kernels with varying complexity and operational intensity on an NVIDIA P100 GPU.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132447835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

UPC++: A High-Performance Communication Framework for Asynchronous Computation 用于异步计算的高性能通信框架

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00104

J. Bachan, S. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, Paul H. Hargrove, H. Ahmed

UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC). We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x. UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications.

upc++是一个c++库，通过异步通信框架支持高性能计算。本文描述了一个新的化身，它与它的前身有很大的不同，我们讨论了我们设计决策的原因。我们提出了新的设计特性，包括基于未来的异步管理、分布式对象和广义远程过程调用(RPC)。我们展示了微基准性能结果，表明upc++中的单边远程内存访问(RMA)与MPI-3 RMA具有竞争力;在Cray XC40上，upc++可将阻塞RMA放置的延迟提高25%，并在RMA吞吐量测试中将带宽提高33%。我们通过一对应用程序主题、一个分布式哈希表和一个稀疏求解器组件，展示了upc++在不规则应用程序中的好处。我们的分布式哈希表在upc++中提供了近线性的弱扩展，可扩展到Cray XC40的34816核。我们的upc++实现的稀疏求解器组件显示出强大的扩展到2048个内核，它比使用MPI通信的变体性能高出3.1倍。upc++鼓励在低开销的RMA和RPC中使用积极的异步，提高程序员的生产力，并在不规则的应用程序中提供高性能。

{"title":"UPC++: A High-Performance Communication Framework for Asynchronous Computation","authors":"J. Bachan, S. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, Paul H. Hargrove, H. Ahmed","doi":"10.1109/IPDPS.2019.00104","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00104","url":null,"abstract":"UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC). We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x. UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129324749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Practically Efficient Scheduler for Minimizing Average Flow Time of Parallel Jobs 最小化并行作业平均流时间的实用高效调度器

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00024

Kunal Agrawal, I. Lee, Jing Li, Kefu Lu, Benjamin Moseley

Many algorithms have been proposed to efficiently schedule parallel jobs on a multicore and/or multiprocessor machine to minimize average flow time, and the complexity of the problem is well understood. In practice, the problem is far from being understood. A reason for the gap between theory and practice is that all theoretical algorithms have prohibitive overheads in actual implementation including using many preemptions. One of the flagship successes of scheduling theory is the work-stealing scheduler. Work-stealing is used for optimizing the flow time of a single parallel job executing on a single machine with multiple cores and has a strong performance in theory and in practice. Consequently, it is implemented in almost all parallel runtime systems. This paper seeks to bridge theory and practice for scheduling parallel jobs that arrive online, by introducing an adaptation of the work-stealing scheduler for average flow time. The new algorithm Distributed Random Equi-Partition (DREP) has strong practical and theoretical performance. Practically, the algorithm has the following advantages: (1) it is non-clairvoyant; (2) all processors make scheduling decisions in a decentralized manner requiring minimal synchronization and communications; and (3) it requires a small and bounded number of preemptions. Theoretically, we prove that DREP is (4+ε)-speed O(1/ε^3)-competitive for average flow time. We have empirically evaluated DREP using both simulations and actual implementation by modifying the Cilk Plus work-stealing runtime system. The evaluation results show that DREP performs well compared to other scheduling strategies, including those that are theoretically good but cannot be faithfully implemented in practice.

为了在多核和/或多处理器机器上有效地调度并行作业以最小化平均流程时间，已经提出了许多算法，并且很好地理解了问题的复杂性。在实践中，这个问题还远未被理解。理论与实践之间存在差距的一个原因是，所有理论算法在实际实现中都有令人望而却步的开销，包括使用许多抢占。调度理论最成功的例子之一就是偷取工作的调度程序。偷工是一种用于优化单个并行作业在多核单机上执行的流程时间的方法，在理论和实践中都具有较强的性能。因此，它可以在几乎所有并行运行时系统中实现。本文通过引入一种基于平均流程时间的偷工调度方法，将理论与实践相结合，实现在线并行作业调度。该算法具有较强的实用性和理论性。在实际应用中，该算法具有以下优点:(1)非透视性;(2)所有处理器以分散的方式做出调度决策，需要最小的同步和通信;(3)它需要少量且有限的优先权。理论上，我们证明了DREP对平均流时间具有(4+ε)-速度O(1/ε^3)竞争性。通过修改Cilk Plus窃取工作的运行时系统，我们通过模拟和实际实现对DREP进行了经验评估。评价结果表明，DREP调度策略优于其他调度策略，包括那些理论上很好但在实践中不能忠实执行的调度策略。

{"title":"Practically Efficient Scheduler for Minimizing Average Flow Time of Parallel Jobs","authors":"Kunal Agrawal, I. Lee, Jing Li, Kefu Lu, Benjamin Moseley","doi":"10.1109/IPDPS.2019.00024","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00024","url":null,"abstract":"Many algorithms have been proposed to efficiently schedule parallel jobs on a multicore and/or multiprocessor machine to minimize average flow time, and the complexity of the problem is well understood. In practice, the problem is far from being understood. A reason for the gap between theory and practice is that all theoretical algorithms have prohibitive overheads in actual implementation including using many preemptions. One of the flagship successes of scheduling theory is the work-stealing scheduler. Work-stealing is used for optimizing the flow time of a single parallel job executing on a single machine with multiple cores and has a strong performance in theory and in practice. Consequently, it is implemented in almost all parallel runtime systems. This paper seeks to bridge theory and practice for scheduling parallel jobs that arrive online, by introducing an adaptation of the work-stealing scheduler for average flow time. The new algorithm Distributed Random Equi-Partition (DREP) has strong practical and theoretical performance. Practically, the algorithm has the following advantages: (1) it is non-clairvoyant; (2) all processors make scheduling decisions in a decentralized manner requiring minimal synchronization and communications; and (3) it requires a small and bounded number of preemptions. Theoretically, we prove that DREP is (4+ε)-speed O(1/ε^3)-competitive for average flow time. We have empirically evaluated DREP using both simulations and actual implementation by modifying the Cilk Plus work-stealing runtime system. The evaluation results show that DREP performs well compared to other scheduling strategies, including those that are theoretically good but cannot be faithfully implemented in practice.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131662877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3