2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第5页

Portal: A High-Performance Language and Compiler for Parallel N-Body Problems 门户:并行n体问题的高性能语言和编译器

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00106

Laleh Aghababaie Beni, S. Ramanan, Aparna Chandramowlishwaran

There is a big gap between the algorithm one designs on paper and the code that runs efficiently on a parallel system. Our goal is to combine the body of work in compilers, performance optimization, and the domain of N-body problems to build a system where domain scientists can write programs at a high level while attaining performance of code written by experts at the low level. This paper presents Portal, a domain-specific language and compiler designed to enable high-performance implementations of N-body problems on modern multicore systems. Our goal in the development of Portal is three-fold, (a) to implement scalable, fast algorithms that have O(n log n) and O (n) complexity, (b) to design an intuitive language to enable rapid implementations of a variety of problems, and (c) to enable parallel large-scale problems to run on multicore systems. We target N-body problems in various domains from machine learning to scientific computing that can be expressed in Portal to obtain an out-of-the-box optimized parallel implementation. Experimental results on 6 N-body problems show that Portal is within a factor of 5% on average of expert hand-optimized C++ code on a dual-socket AMD EPYC processor. To our knowledge, there are no known libraries or frameworks that implement parallel asymptotically optimal algorithms for the class of generalized N-body problems and Portal aims to fill this gap. Moreover, the Portal language and intermediate algorithm representation are portable and easily extensible to different platforms.

在纸上设计的算法和在并行系统上有效运行的代码之间存在很大的差距。我们的目标是将编译器、性能优化和n体问题领域的工作结合起来，构建一个系统，在这个系统中，领域科学家可以在高水平上编写程序，同时在低水平上获得专家编写的代码的性能。本文介绍了Portal，一种特定于领域的语言和编译器，旨在实现现代多核系统上n体问题的高性能实现。我们开发Portal的目标有三个方面，(a)实现复杂度为O(n log n)和O(n)的可扩展快速算法，(b)设计一种直观的语言以实现各种问题的快速实现，(c)使并行大规模问题能够在多核系统上运行。我们针对从机器学习到科学计算的各个领域的n体问题，可以在Portal中表示，以获得开箱即用的优化并行实现。对6个n体问题的实验结果表明，在双插槽AMD EPYC处理器上，Portal与专家手工优化的c++代码相比，平均误差在5%以内。据我们所知，目前还没有已知的库或框架来实现针对广义n体问题的并行渐近最优算法，而Portal旨在填补这一空白。此外，Portal语言和中间算法表示是可移植的，并且很容易扩展到不同的平台。

{"title":"Portal: A High-Performance Language and Compiler for Parallel N-Body Problems","authors":"Laleh Aghababaie Beni, S. Ramanan, Aparna Chandramowlishwaran","doi":"10.1109/IPDPS.2019.00106","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00106","url":null,"abstract":"There is a big gap between the algorithm one designs on paper and the code that runs efficiently on a parallel system. Our goal is to combine the body of work in compilers, performance optimization, and the domain of N-body problems to build a system where domain scientists can write programs at a high level while attaining performance of code written by experts at the low level. This paper presents Portal, a domain-specific language and compiler designed to enable high-performance implementations of N-body problems on modern multicore systems. Our goal in the development of Portal is three-fold, (a) to implement scalable, fast algorithms that have O(n log n) and O (n) complexity, (b) to design an intuitive language to enable rapid implementations of a variety of problems, and (c) to enable parallel large-scale problems to run on multicore systems. We target N-body problems in various domains from machine learning to scientific computing that can be expressed in Portal to obtain an out-of-the-box optimized parallel implementation. Experimental results on 6 N-body problems show that Portal is within a factor of 5% on average of expert hand-optimized C++ code on a dual-socket AMD EPYC processor. To our knowledge, there are no known libraries or frameworks that implement parallel asymptotically optimal algorithms for the class of generalized N-body problems and Portal aims to fill this gap. Moreover, the Portal language and intermediate algorithm representation are portable and easily extensible to different platforms.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128921454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics CuSP:一个可定制的分布式图分析流边缘分区器

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00054

Loc Hoang, Roshan Dathathri, G. Gill, K. Pingali

Graph analytics systems must analyze graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the graph among the machines in the cluster. Existing graph analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone graph partitioners are available, but they too implement only a small number of partitioning policies. This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

图形分析系统必须分析具有数十亿个顶点和边的图形，这需要数tb的存储空间。分布式内存集群通常用于分析如此大的图，因为单个机器的主内存通常限制在几百gb。这需要在集群中的机器之间划分图。现有的图分析系统通常带有内置的分区器，该分区器包含特定的分区策略，但是最佳的分区策略取决于算法、输入图和平台。因此，内置分区程序不够灵活。独立的图分区器是可用的，但是它们也只实现少量的分区策略。本文提出了一种快速流边缘分区框架CuSP，它允许用户在高抽象级别指定所需的分区策略，并快速生成高质量的图分区。例如，对于拥有128台机器的集群，它可以在2分钟内对wdc12进行分区，wdc12是公开可用的最大的网络爬行图，具有40亿个顶点和1290亿个边。我们的实验表明，它可以生成高质量的分区，平均速度比文献中最先进的独立分区器快6倍，同时支持更广泛的分区策略。

{"title":"CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics","authors":"Loc Hoang, Roshan Dathathri, G. Gill, K. Pingali","doi":"10.1109/IPDPS.2019.00054","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00054","url":null,"abstract":"Graph analytics systems must analyze graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the graph among the machines in the cluster. Existing graph analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone graph partitioners are available, but they too implement only a small number of partitioning policies. This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129116289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Excavating the Potential of GPU for Accelerating Graph Traversal 挖掘GPU加速图遍历的潜力

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00032

Pengyu Wang, Lu Zhang, Chao Li, M. Guo

Graph traversal is an essential procedure for a growing amount of applications today. This type of algorithms typically iterate input graph datasets until convergence and the logic of each iteration is quite simple. GPUs are used extensively as graph traversal accelerators due to the capability of massive parallelism and high-bandwidth memory access. However, existing methods are inefficient in two ways. First, streaming multiprocessors (SMs) are still underutilized due to the unbalanced load allocation and uncoalesced memory access. Second, they use space-inefficient data structures or need auxiliary data to assist traversal. It is undesirable, considering the limited GPU memory capacity. Moreover, existing designs commonly focus on optimizing kernel execution time. Data-transfer time is also notable in the whole procedure. Thus, space-efficient data structure and data-transfer policy should be concerned. In this paper, we propose EtaGraph, a novel GPU graph traversal framework optimized for GPU memory system and execution parallelism. EtaGraph has several features: 1). It uses a frontier-like kernel execution model, featuring a lightweight graph transformation procedure, named Unified Degree Cut, allowing GPU threads to process skewed graph efficiently without modification of raw data or introducing extra space overhead; 2). It uses on-demand data-transfer to overlap computation so that it optimizes the total time of data-transfer and execution; 3). It adopts an explicit utilization of Shared Memory to enhance memory coalescing and to improve effective memory bandwidth. Evaluation of EtaGraph shows significant and consistent speedups over the state-of-the-art GPU-based graph processing frameworks on both real-world and synthetic graphs.

图遍历是当今越来越多的应用程序的基本过程。这种类型的算法通常迭代输入图数据集，直到收敛，每次迭代的逻辑非常简单。由于gpu具有大规模并行性和高带宽内存访问能力，因此被广泛用作图形遍历加速器。然而，现有的方法在两个方面效率低下。首先，由于不平衡的负载分配和未合并的内存访问，流多处理器(SMs)仍然未得到充分利用。其次，它们使用空间效率低的数据结构或需要辅助数据来协助遍历。考虑到有限的GPU内存容量，这是不可取的。此外，现有的设计通常侧重于优化内核执行时间。数据传输时间在整个过程中也很重要。因此，应该关注空间高效的数据结构和数据传输策略。在本文中，我们提出了一种新的GPU图形遍历框架EtaGraph，它针对GPU存储系统和执行并行性进行了优化。EtaGraph有几个特点:1)它使用了一个类似边界的内核执行模型，具有轻量级的图形转换过程，称为统一度切割，允许GPU线程在不修改原始数据或引入额外空间开销的情况下有效地处理倾斜图形;2)采用按需数据传输重叠计算，优化数据传输和执行的总时间;3)显式利用共享内存，增强内存合并，提高有效内存带宽。对EtaGraph的评估显示，在现实世界和合成图形上，与最先进的基于gpu的图形处理框架相比，EtaGraph具有显著和一致的加速。

{"title":"Excavating the Potential of GPU for Accelerating Graph Traversal","authors":"Pengyu Wang, Lu Zhang, Chao Li, M. Guo","doi":"10.1109/IPDPS.2019.00032","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00032","url":null,"abstract":"Graph traversal is an essential procedure for a growing amount of applications today. This type of algorithms typically iterate input graph datasets until convergence and the logic of each iteration is quite simple. GPUs are used extensively as graph traversal accelerators due to the capability of massive parallelism and high-bandwidth memory access. However, existing methods are inefficient in two ways. First, streaming multiprocessors (SMs) are still underutilized due to the unbalanced load allocation and uncoalesced memory access. Second, they use space-inefficient data structures or need auxiliary data to assist traversal. It is undesirable, considering the limited GPU memory capacity. Moreover, existing designs commonly focus on optimizing kernel execution time. Data-transfer time is also notable in the whole procedure. Thus, space-efficient data structure and data-transfer policy should be concerned. In this paper, we propose EtaGraph, a novel GPU graph traversal framework optimized for GPU memory system and execution parallelism. EtaGraph has several features: 1). It uses a frontier-like kernel execution model, featuring a lightweight graph transformation procedure, named Unified Degree Cut, allowing GPU threads to process skewed graph efficiently without modification of raw data or introducing extra space overhead; 2). It uses on-demand data-transfer to overlap computation so that it optimizes the total time of data-transfer and execution; 3). It adopts an explicit utilization of Shared Memory to enhance memory coalescing and to improve effective memory bandwidth. Evaluation of EtaGraph shows significant and consistent speedups over the state-of-the-art GPU-based graph processing frameworks on both real-world and synthetic graphs.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131684352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Exploiting Adaptive Data Compression to Improve Performance and Energy-Efficiency of Compute Workloads in Multi-GPU Systems 利用自适应数据压缩提高多gpu系统中计算工作负载的性能和能效

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00075

Mohammad Khavari Tavana, Yifan Sun, Nicolas Bohm Agostini, D. Kaeli

Graphics Processing Unit (GPU) performance has relied heavily on our ability to scale of number of transistors on chip, in order to satisfy the ever-increasing demands for more computation. However, transistor scaling has become extremely challenging, limiting the number of transistors that can be crammed onto a single die. Manufacturing large, fast and energy-efficient monolithic GPUs, while growing the number of stream processing units on-chip, is no longer a viable solution to scale performance. GPU vendors are aiming to exploit multi-GPU solutions, interconnecting multiple GPUs in the single node with a high bandwidth network (such as NVLink), or exploiting Multi-Chip-Module (MCM) packaging, where multiple GPU modules are integrated in a single package. The inter-GPU bandwidth is an expensive and critical resource for designing multi-GPU systems. The design of the inter-GPU network can impact performance significantly. To address this challenge, in this paper we explore the potential of hardware-based memory compression algorithms to save bandwidth and improve energy efficiency in multi-GPU systems. Specifically, we propose an adaptive inter-GPU data compression scheme to efficiently improve both performance and energy efficiency. Our evaluation shows that the proposed optimization on multi-GPU architectures can reduce the inter-GPU traffic up to 62%, improve system performance by up to 33%, and save energy spent powering the communication fabric by 45%, on average.

图形处理单元(GPU)的性能在很大程度上依赖于我们在芯片上扩展晶体管数量的能力，以满足不断增长的计算需求。然而，晶体管的缩放已经变得非常具有挑战性，限制了可以塞在单个芯片上的晶体管数量。制造大型，快速和节能的单片gpu，同时增加片上流处理单元的数量，不再是扩展性能的可行解决方案。GPU厂商的目标是开发多GPU解决方案，通过高带宽网络(如NVLink)将单个节点中的多个GPU互连，或者开发多芯片模块(MCM)封装，将多个GPU模块集成在单个封装中。gpu间带宽是设计多gpu系统的一项昂贵而关键的资源。gpu间网络的设计对性能影响很大。为了解决这一挑战，在本文中，我们探索了基于硬件的内存压缩算法的潜力，以节省带宽并提高多gpu系统的能源效率。具体来说，我们提出了一种自适应的gpu间数据压缩方案，以有效地提高性能和能源效率。我们的评估表明，在多gpu架构上提出的优化可以减少多达62%的gpu间流量，提高多达33%的系统性能，平均节省45%的通信结构的能源消耗。

{"title":"Exploiting Adaptive Data Compression to Improve Performance and Energy-Efficiency of Compute Workloads in Multi-GPU Systems","authors":"Mohammad Khavari Tavana, Yifan Sun, Nicolas Bohm Agostini, D. Kaeli","doi":"10.1109/IPDPS.2019.00075","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00075","url":null,"abstract":"Graphics Processing Unit (GPU) performance has relied heavily on our ability to scale of number of transistors on chip, in order to satisfy the ever-increasing demands for more computation. However, transistor scaling has become extremely challenging, limiting the number of transistors that can be crammed onto a single die. Manufacturing large, fast and energy-efficient monolithic GPUs, while growing the number of stream processing units on-chip, is no longer a viable solution to scale performance. GPU vendors are aiming to exploit multi-GPU solutions, interconnecting multiple GPUs in the single node with a high bandwidth network (such as NVLink), or exploiting Multi-Chip-Module (MCM) packaging, where multiple GPU modules are integrated in a single package. The inter-GPU bandwidth is an expensive and critical resource for designing multi-GPU systems. The design of the inter-GPU network can impact performance significantly. To address this challenge, in this paper we explore the potential of hardware-based memory compression algorithms to save bandwidth and improve energy efficiency in multi-GPU systems. Specifically, we propose an adaptive inter-GPU data compression scheme to efficiently improve both performance and energy efficiency. Our evaluation shows that the proposed optimization on multi-GPU architectures can reduce the inter-GPU traffic up to 62%, improve system performance by up to 33%, and save energy spent powering the communication fabric by 45%, on average.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127660120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Distributed Dominating Set and Connected Dominating Set Construction Under the Dynamic SINR Model 动态SINR模型下的分布式控制集和连通控制集构造

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00092

Dongxiao Yu, Yifei Zou, Yong Zhang, Feng Li, Jiguo Yu, Yu Wu, Xiuzhen Cheng, F. Lau

This paper investigates distributed Dominating Set (DS) and Connected Dominating Set (CDS) construction in dynamic wireless networks under the SINR interference model. Specifically, we present a new model for dynamic networks that admits both churns (due to node arrivals/departures) and node mobility. Under this dynamic model, we propose efficient algorithms to construct a DS and a CDS with constant approximation ratios w.r.t. the corresponding minimum ones in O(log n) time with a high probability guarantee. To the best of our knowledge, these algorithms are the first known ones for DS and CDS construction in dynamic networks assuming the SINR interference model. We believe our dynamic network model can greatly facilitate distributed algorithm studies in mobile and dynamic wireless networks.

研究了SINR干扰模型下动态无线网络中分布式控制集和连通控制集的构造问题。具体来说，我们提出了一个动态网络的新模型，该模型既承认流失(由于节点到达/离开)，也承认节点移动。在此动态模型下，我们提出了一种高效的算法，可以在O(log n)时间内构造具有常数近似比的DS和CDS，并具有高概率保证。据我们所知，这些算法是在假设SINR干扰模型的动态网络中构建DS和CDS的第一个已知算法。我们相信我们的动态网络模型可以极大地促进移动和动态无线网络中的分布式算法研究。

引用次数: 14

Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs 基于gpu半精度算法的小尺寸快速批处理矩阵乘法

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00022

A. Abdelfattah, S. Tomov, J. Dongarra

Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2x and 10x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.

矩阵乘法(GEMM)是密集线性代数中最重要的运算。由于GEMM是一种计算约束的操作，具有丰富的数据重用性，因此来自不同科学领域的许多应用程序都将其性能最关键的阶段用于使用GEMM。随着批处理线性代数的兴起，批处理GEMM操作在密集线性求解器以外的领域变得越来越流行，比如张量收缩、稀疏直接求解器和机器学习。特别是对于后者，降低精度的批处理GEMM(即FP16)已经成为许多深度学习框架的核心操作。本文介绍了一种基于图形处理单元(gpu)的FP16算法(HGEMM)的优化批处理GEMM。我们提供了一个详细的设计策略，利用了最近在支持cuda的gpu中引入的Tensor Core技术。开发的解决方案在优化设计中使用供应商提供的低级api，克服了硬件施加的限制(以离散配置的形式)。其结果是一个高度灵活的GPU内核，为开发人员提供了大量的控制，尽管前面提到的限制。本文还特别注意了不能完全占据张量核心单元的非常小的矩阵的乘法。我们的研究结果表明，在使用Tesla V100 GPU的情况下，所提出的设计可以比高度优化的供应商程序的性能高出1.2倍到10倍。对于非常小的矩阵，观察到的加速范围在1.8到26x之间。

{"title":"Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs","authors":"A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1109/IPDPS.2019.00022","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00022","url":null,"abstract":"Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2x and 10x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124720763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Semantics-Aware Virtual Machine Image Management in IaaS Clouds IaaS云中的语义感知虚拟机映像管理

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00052

Nishant Saurabh, Julian Remmers, Dragi Kimovski, R. Prodan, Jorge G. Barbosa

Infrastructure-as-a-service (IaaS) Clouds concurrently accommodate diverse sets of user requests, requiring an efficient strategy for storing and retrieving virtual machine images (VMIs) at a large scale. The VMI storage management require dealing with multiple VMIs, typically in the magnitude of gigabytes, which entails VMI sprawl issues hindering the elastic resource management and provisioning. Nevertheless, existing techniques to facilitate VMI management overlook VMI semantics (i.e at the level of base image and software packages) with either restricted possibility to identify and extract reusable functionalities or with higher VMI publish and retrieval overheads. In this paper, we design, implement and evaluate Expelliarmus, a novel VMI management system that helps to minimize storage, publish and retrieval overheads. To achieve this goal, Expelliarmus incorporates three complementary features. First, it makes use of VMIs modelled as semantic graphs to expedite the similarity computation between multiple VMIs. Second, Expelliarmus provides a semantic aware VMI decomposition and base image selection to extract and store non-redundant base image and software packages. Third, Expelliarmus can also assemble VMIs based on the required software packages upon user request. We evaluate Expelliarmus through a representative set of synthetic Cloud VMIs on the real test-bed. Experimental results show that our semantic-centric approach is able to optimize repository size by 2.2-16 times compared to state-of-the-art systems (e.g. IBM's Mirage and Hemera) with significant VMI publish and retrieval performance improvement.

基础设施即服务(IaaS)云同时容纳不同的用户请求集，需要一种有效的策略来大规模存储和检索虚拟机映像(vmi)。VMI存储管理需要处理多个VMI，通常以千兆字节为单位，这导致了VMI扩展问题，阻碍了弹性资源管理和供应。然而，促进VMI管理的现有技术忽略了VMI语义(即在基本映像和软件包级别)，要么识别和提取可重用功能的可能性有限，要么VMI发布和检索开销较高。在本文中，我们设计、实现和评估了Expelliarmus，这是一个新的VMI管理系统，有助于最大限度地减少存储、发布和检索开销。为了实现这一目标，除你武器结合了三个互补的特点。首先，利用语义图建模的vmi来加快多个vmi之间的相似度计算;其次，Expelliarmus提供语义感知的VMI分解和基础图像选择，提取和存储非冗余的基础图像和软件包。第三，Expelliarmus还可以根据用户要求，根据所需软件包组装VMIs。我们在真实的测试平台上通过一组具有代表性的合成云vmi来评估Expelliarmus。实验结果表明，与最先进的系统(例如IBM的Mirage和Hemera)相比，我们以语义为中心的方法能够优化存储库大小2.2-16倍，并显著提高VMI发布和检索性能。

{"title":"Semantics-Aware Virtual Machine Image Management in IaaS Clouds","authors":"Nishant Saurabh, Julian Remmers, Dragi Kimovski, R. Prodan, Jorge G. Barbosa","doi":"10.1109/IPDPS.2019.00052","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00052","url":null,"abstract":"Infrastructure-as-a-service (IaaS) Clouds concurrently accommodate diverse sets of user requests, requiring an efficient strategy for storing and retrieving virtual machine images (VMIs) at a large scale. The VMI storage management require dealing with multiple VMIs, typically in the magnitude of gigabytes, which entails VMI sprawl issues hindering the elastic resource management and provisioning. Nevertheless, existing techniques to facilitate VMI management overlook VMI semantics (i.e at the level of base image and software packages) with either restricted possibility to identify and extract reusable functionalities or with higher VMI publish and retrieval overheads. In this paper, we design, implement and evaluate Expelliarmus, a novel VMI management system that helps to minimize storage, publish and retrieval overheads. To achieve this goal, Expelliarmus incorporates three complementary features. First, it makes use of VMIs modelled as semantic graphs to expedite the similarity computation between multiple VMIs. Second, Expelliarmus provides a semantic aware VMI decomposition and base image selection to extract and store non-redundant base image and software packages. Third, Expelliarmus can also assemble VMIs based on the required software packages upon user request. We evaluate Expelliarmus through a representative set of synthetic Cloud VMIs on the real test-bed. Experimental results show that our semantic-centric approach is able to optimize repository size by 2.2-16 times compared to state-of-the-art systems (e.g. IBM's Mirage and Hemera) with significant VMI publish and retrieval performance improvement.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125131537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

iez: Resource Contention Aware Load Balancing for Large-Scale Parallel File Systems 大规模并行文件系统的资源竞争感知负载平衡

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00070

Bharti Wadhwa, A. Paul, Sarah Neuwirth, Feiyi Wang, S. Oral, A. Butt, Jon Bernard, K. Cameron

Parallel I/O performance is crucial to sustaining scientific applications on large-scale High-Performance Computing (HPC) systems. However, I/O load imbalance in the underlying distributed and shared storage systems can significantly reduce overall application performance. There are two conflicting challenges to mitigate this load imbalance: (i) optimizing systemwide data placement to maximize the bandwidth advantages of distributed storage servers, i.e., allocating I/O resources efficiently across applications and job runs; and (ii) optimizing client-centric data movement to minimize I/O load request latency between clients and servers, i.e., allocating I/O resources efficiently in service to a single application and job run. Moreover, existing approaches that require application changes limit wide-spread adoption in commercial or proprietary deployments. We propose iez, an "end-to-end control plane" where clients transparently and adaptively write to a set of selected I/O servers to achieve balanced data placement. Our control plane leverages realtime load information for distributed storage server global data placement while our design model leverages trace-based optimization techniques to minimize I/O load request latency between clients and servers. We evaluate our proposed system on an experimental cluster for two common use cases: synthetic I/O benchmark IOR for large sequential writes and a scientific application I/O kernel, HACC-I/O. Results show read and write performance improvements of up to 34% and 32%, respectively, compared to the state of the art.

并行I/O性能对于维持大规模高性能计算(HPC)系统上的科学应用程序至关重要。但是，底层分布式和共享存储系统中的I/O负载不平衡会显著降低应用程序的整体性能。缓解这种负载不平衡存在两个相互冲突的挑战:(i)优化系统范围的数据放置，以最大限度地利用分布式存储服务器的带宽优势，即在应用程序和作业运行之间有效地分配i /O资源;(ii)优化以客户端为中心的数据移动，以最小化客户端和服务器之间的I/O负载请求延迟，即在服务中为单个应用程序和作业运行有效地分配I/O资源。此外，需要更改应用程序的现有方法限制了在商业或专有部署中的广泛采用。我们提出了iez，一个“端到端控制平面”，客户端可以透明且自适应地写入一组选定的I/O服务器，以实现平衡的数据放置。我们的控制平面利用了分布式存储服务器全局数据放置的实时负载信息，而我们的设计模型利用了基于跟踪的优化技术来最小化客户端和服务器之间的I/O负载请求延迟。我们在两个常见用例的实验集群上评估了我们提出的系统:用于大顺序写入的合成I/O基准IOR和科学应用I/O内核HACC-I/O。结果显示，与现有技术相比，读写性能分别提高了34%和32%。

{"title":"iez: Resource Contention Aware Load Balancing for Large-Scale Parallel File Systems","authors":"Bharti Wadhwa, A. Paul, Sarah Neuwirth, Feiyi Wang, S. Oral, A. Butt, Jon Bernard, K. Cameron","doi":"10.1109/IPDPS.2019.00070","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00070","url":null,"abstract":"Parallel I/O performance is crucial to sustaining scientific applications on large-scale High-Performance Computing (HPC) systems. However, I/O load imbalance in the underlying distributed and shared storage systems can significantly reduce overall application performance. There are two conflicting challenges to mitigate this load imbalance: (i) optimizing systemwide data placement to maximize the bandwidth advantages of distributed storage servers, i.e., allocating I/O resources efficiently across applications and job runs; and (ii) optimizing client-centric data movement to minimize I/O load request latency between clients and servers, i.e., allocating I/O resources efficiently in service to a single application and job run. Moreover, existing approaches that require application changes limit wide-spread adoption in commercial or proprietary deployments. We propose iez, an \"end-to-end control plane\" where clients transparently and adaptively write to a set of selected I/O servers to achieve balanced data placement. Our control plane leverages realtime load information for distributed storage server global data placement while our design model leverages trace-based optimization techniques to minimize I/O load request latency between clients and servers. We evaluate our proposed system on an experimental cluster for two common use cases: synthetic I/O benchmark IOR for large sequential writes and a scientific application I/O kernel, HACC-I/O. Results show read and write performance improvements of up to 34% and 32%, respectively, compared to the state of the art.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127894168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention 减少IO争用的突发缓冲区大小和分区策略

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00072

G. Aupy, Olivier Beaumont, Lionel Eyraud-Dubois

Burst-Buffers are high throughput and small size storage which are being used as an intermediate storage between the PFS (Parallel File System) and the computational nodes of modern HPC systems. They can allow to hinder to contention to the PFS, a shared resource whose read and write performance increase slower than processing power in HPC systems. A second usage is to accelerate data transfers and to hide the latency to the PFS. In this paper, we concentrate on the first usage. We propose a model for Burst-Buffers and application transfers. We consider the problem of dimensioning and sharing the Burst-Buffers between several applications. This dimensioning can be done either dynamically or statically. The dynamic allocation considers that any application can use any available portion of the Burst-Buffers. The static allocation considers that when a new application enters the system, it is assigned some portion of the Burst-Buffers, which cannot be used by the other applications until that application leaves the system and its data is purged from it. We show that the general sharing problem to guarantee fair performance for all applications is an NP-Complete problem. We propose a polynomial time algorithms for the special case of finding the optimal buffer size such that no application is slowed down due to PFS contention, both in the static and dynamic cases. Finally, we provide evaluations of our algorithms in realistic settings. We use those to discuss how to minimize the overhead of the static allocation of buffers compared to the dynamic allocation.

突发缓冲区是一种高吞吐量和小容量的存储，被用作PFS(并行文件系统)和现代高性能计算系统计算节点之间的中间存储。它们可以阻止对PFS的争用，PFS是一种共享资源，其读写性能的增长速度比HPC系统中的处理能力要慢。第二个用途是加速数据传输并隐藏PFS的延迟。在本文中，我们主要关注第一种用法。我们提出了一个突发缓冲和应用传输的模型。我们考虑了在多个应用程序之间对突发缓冲区进行量纲化和共享的问题。这种标注可以动态地或静态地完成。动态分配考虑到任何应用程序都可以使用Burst-Buffers的任何可用部分。静态分配认为，当一个新应用程序进入系统时，它被分配了Burst-Buffers的一部分，在该应用程序离开系统并从中清除其数据之前，该部分不能被其他应用程序使用。我们证明了保证所有应用程序的公平性能的一般共享问题是一个np完全问题。我们提出了一种多项式时间算法，用于寻找最优缓冲区大小的特殊情况，这样在静态和动态情况下，不会因为PFS争用而减慢应用程序的速度。最后，我们在现实环境中对我们的算法进行了评估。我们使用它们来讨论与动态分配相比，如何最小化静态分配缓冲区的开销。

{"title":"Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention","authors":"G. Aupy, Olivier Beaumont, Lionel Eyraud-Dubois","doi":"10.1109/IPDPS.2019.00072","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00072","url":null,"abstract":"Burst-Buffers are high throughput and small size storage which are being used as an intermediate storage between the PFS (Parallel File System) and the computational nodes of modern HPC systems. They can allow to hinder to contention to the PFS, a shared resource whose read and write performance increase slower than processing power in HPC systems. A second usage is to accelerate data transfers and to hide the latency to the PFS. In this paper, we concentrate on the first usage. We propose a model for Burst-Buffers and application transfers. We consider the problem of dimensioning and sharing the Burst-Buffers between several applications. This dimensioning can be done either dynamically or statically. The dynamic allocation considers that any application can use any available portion of the Burst-Buffers. The static allocation considers that when a new application enters the system, it is assigned some portion of the Burst-Buffers, which cannot be used by the other applications until that application leaves the system and its data is purged from it. We show that the general sharing problem to guarantee fair performance for all applications is an NP-Complete problem. We propose a polynomial time algorithms for the special case of finding the optimal buffer size such that no application is slowed down due to PFS contention, both in the static and dynamic cases. Finally, we provide evaluations of our algorithms in realistic settings. We use those to discuss how to minimize the overhead of the static allocation of buffers compared to the dynamic allocation.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128985039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Cpp-Taskflow: Fast Task-Based Parallel Programming Using Modern C++ p- taskflow:使用现代c++快速的基于任务的并行编程

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00105

Tsung-Wei Huang, Chun-Xun Lin, Guannan Guo, Martin D. F. Wong

In this paper we introduce Cpp-Taskflow, a new C++ tasking library to help developers quickly write parallel programs using task dependency graphs. Cpp-Taskflow leverages the power of modern C++ and task-based approaches to enable efficient implementations of parallel decomposition strategies. Our programming model can quickly handle not only traditional loop-level parallelism, but also irregular patterns such as graph algorithms, incremental flows, and dynamic data structures. Compared with existing libraries, Cpp-Taskflow is more cost efficient in performance scaling and software integration. We have evaluated Cpp-Taskflow on both micro-benchmarks and real-world applications with million-scale tasking. In a machine learning example, Cpp-Taskflow achieved 1.5–2.7× less coding complexity and 14–38% speed-up over two industrial-strength libraries OpenMP Tasking and Intel Threading Building Blocks (TBB).

在本文中，我们介绍了一个新的c++任务库Cpp-Taskflow，它可以帮助开发人员使用任务依赖图快速编写并行程序。Cpp-Taskflow利用现代c++和基于任务的方法的强大功能来实现并行分解策略的有效实现。我们的编程模型不仅可以快速处理传统的循环级并行，还可以处理不规则模式，如图算法、增量流和动态数据结构。与现有库相比，Cpp-Taskflow在性能扩展和软件集成方面更具成本效益。我们已经在微基准测试和具有百万级任务的实际应用程序上评估了cp - taskflow。在一个机器学习的例子中，cp - taskflow实现了1.5 - 2.7倍的编码复杂性和14-38%的速度比两个工业强度的库OpenMP Tasking和Intel Threading Building Blocks (TBB)。

引用次数: 51