Proceedings of the 37th International Conference on Supercomputing最新文献_第4页

BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUs BitGNN:在gpu上释放二值图神经网络的性能潜力

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-05-04 DOI: 10.1145/3577193.3593725

Jou-An Chen, Hsin-Hsuan Sung, Xipeng Shen, Sutanay Choudhury, Ang Li

Recent studies have shown that Binary Graph Neural Networks (GNNs) are promising for saving computations of GNNs through binarized tensors. Prior work, however, mainly focused on algorithm designs or training techniques, leaving it open to how to materialize the performance potential on accelerator hardware fully. This work redesigns the binary GNN inference backend from the efficiency perspective. It fills the gap by proposing a series of abstractions and techniques to map binary GNNs and their computations best to fit the nature of bit manipulations on GPUs. Results on real-world graphs with GCNs, GraphSAGE, and GraphSAINT show that the proposed techniques outperform state-of-the-art binary GNN implementations by 8-22X with the same accuracy maintained. BitGNN code is publicly available.1.

近年来的研究表明，二值图神经网络(GNNs)有望通过二值化张量节省GNNs的计算量。然而，之前的工作主要集中在算法设计或训练技术上，如何在加速器硬件上充分实现性能潜力是一个开放的问题。本文从效率的角度重新设计了二进制GNN推理后端。它通过提出一系列抽象和技术来映射二进制gnn及其计算，以最好地适应gpu上的位操作的性质，从而填补了这一空白。在使用GCNs、GraphSAGE和GraphSAINT的真实图形上的结果表明，所提出的技术在保持相同精度的情况下比最先进的二进制GNN实现高出8-22X。BitGNN代码是公开的。

引用次数: 0

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs gpu上具有在线容错功能的高性能GEMM剖析

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-05-01 DOI: 10.1145/3577193.3593715

Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing since an efficient GEMM implementation is essential for the performance of these calculations. While researchers often strive for faster performance by using large computing platforms, the increased scale of these systems can raise concerns about hardware and software reliability. In this paper, we present a design of a high-performance GPU-based GEMM that integrates an algorithm-based fault tolerance scheme that detects and corrects silent data corruptions at computing units on-the-fly. We explore fault-tolerant designs for GEMM at the thread, warp, and threadblock levels, and also provide a baseline GEMM implementation that is competitive with or faster than the state-of-the-art, closed-source cuBLAS GEMM. We present a kernel fusion strategy to overlap and mitigate the memory latency due to fault tolerance with the original GEMM computation. To support a wide range of input matrix shapes and reduce development costs, we present a template-based approach for automatic code generation for both fault-tolerant and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA Tesla T4 and A100 server GPUs. Our experimental results demonstrate that our baseline GEMM shows comparable or superior performance compared to the closed-source cuBLAS. Compared with the prior state-of-the-art non-fused fault-tolerant GEMM, our optimal fused strategy achieves a 39.04% speedup on average. In addition, our fault-tolerant GEMM incurs only a minimal overhead (8.89% on average) compared to cuBLAS even with hundreds of errors injected per minute. For irregularly shaped inputs, the code generator-generated kernels show remarkable speedups of 160% ~ 183.5% and 148.55% ~ 165.12% for fault-tolerant and non-fault-tolerant GEMMs, respectively, which outperforms cuBLAS by up to 41.40%.

通用矩阵乘法(GEMM)是机器学习和科学计算等各种应用的关键算法，因为有效的GEMM实现对于这些计算的性能至关重要。虽然研究人员经常通过使用大型计算平台来争取更快的性能，但这些系统规模的增加可能会引起对硬件和软件可靠性的担忧。在本文中，我们提出了一种基于gpu的高性能GEMM设计，该GEMM集成了一种基于算法的容错方案，该方案可以实时检测和纠正计算单元中的静默数据损坏。我们在线程、warp和线程块级别探索了GEMM的容错设计，并提供了与最先进的闭源cuBLAS GEMM竞争或更快的基线GEMM实现。我们提出了一种核融合策略，以重叠和减轻由于容错导致的内存延迟与原始GEMM计算。为了支持广泛的输入矩阵形状并降低开发成本，我们提出了一种基于模板的方法，用于容错和非容错GEMM实现的自动代码生成。我们评估了我们在NVIDIA Tesla T4和A100服务器gpu上的工作。我们的实验结果表明，与闭源cuBLAS相比，我们的基线GEMM具有相当或更好的性能。与之前最先进的非融合容错GEMM相比，我们的最优融合策略平均实现了39.04%的加速。此外，即使每分钟注入数百个错误，与cuBLAS相比，我们的容错gem只会产生最小的开销(平均8.89%)。对于不规则形状的输入，代码生成器生成的内核在容错和非容错gem上分别表现出160% ~ 183.5%和148.55% ~ 165.12%的显著提速，比cuBLAS的提速幅度高达41.40%。

{"title":"Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs","authors":"Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen","doi":"10.1145/3577193.3593715","DOIUrl":"https://doi.org/10.1145/3577193.3593715","url":null,"abstract":"General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing since an efficient GEMM implementation is essential for the performance of these calculations. While researchers often strive for faster performance by using large computing platforms, the increased scale of these systems can raise concerns about hardware and software reliability. In this paper, we present a design of a high-performance GPU-based GEMM that integrates an algorithm-based fault tolerance scheme that detects and corrects silent data corruptions at computing units on-the-fly. We explore fault-tolerant designs for GEMM at the thread, warp, and threadblock levels, and also provide a baseline GEMM implementation that is competitive with or faster than the state-of-the-art, closed-source cuBLAS GEMM. We present a kernel fusion strategy to overlap and mitigate the memory latency due to fault tolerance with the original GEMM computation. To support a wide range of input matrix shapes and reduce development costs, we present a template-based approach for automatic code generation for both fault-tolerant and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA Tesla T4 and A100 server GPUs. Our experimental results demonstrate that our baseline GEMM shows comparable or superior performance compared to the closed-source cuBLAS. Compared with the prior state-of-the-art non-fused fault-tolerant GEMM, our optimal fused strategy achieves a 39.04% speedup on average. In addition, our fault-tolerant GEMM incurs only a minimal overhead (8.89% on average) compared to cuBLAS even with hundreds of errors injected per minute. For irregularly shaped inputs, the code generator-generated kernels show remarkable speedups of 160% ~ 183.5% and 148.55% ~ 165.12% for fault-tolerant and non-fault-tolerant GEMMs, respectively, which outperforms cuBLAS by up to 41.40%.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116360146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs GPULZ:在现代gpu上优化LZSS多字节数据无损压缩

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-04-14 DOI: 10.1145/3577193.3593706

Bo Zhang, Jiannan Tian, S. Di, Xiaodong Yu, M. Swany, Dingwen Tao, F. Cappello

Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose gpuLZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate gpuLZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that gpuLZ achieves up to 272.1× speedup on A4000 and up to 1.4× higher compression ratio compared to state-of-the-art solutions.

当今的图形处理单元(GPU)应用程序产生大量数据，这对高效存储和传输具有挑战性。因此，数据压缩成为减轻存储负担和通信成本的关键技术。LZSS算法是Deflate等许多应用广泛的压缩器的核心算法。然而，由于LZSS算法的顺序性，现有的基于gpu的LZSS压缩器的吞吐量很低。此外，许多GPU应用程序产生多字节数据(例如，int16/int32索引，浮点数)，而当前的LZSS压缩只接受单字节数据作为输入。为此，在这项工作中，我们提出了gpuLZ，这是一种在现代gpu上用于多字节数据的高效LZSS压缩。我们的工作贡献有四个方面:首先，我们对现有的gpu LZ压缩器进行了深入分析，并调查了它们的主要问题。然后，我们提出了两个主要的算法级优化。具体而言，我们(1)将前缀和从一次传递改为两次传递，并融合多个核，以减少共享内存和全局内存之间的数据移动;(2)优化现有的多字节符号模式匹配方法，以降低计算复杂度，探索更长的重复模式。第三，我们执行架构性能优化，例如通过适应不同GPU架构的数据分区来最大化共享内存利用率。最后，我们使用NVIDIA A100和A4000 gpu在6个不同类型的数据集上对gpuLZ进行了评估。结果表明，与最先进的解决方案相比，gpuLZ在A4000上实现了高达272.1倍的加速和高达1.4倍的压缩比。

{"title":"GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs","authors":"Bo Zhang, Jiannan Tian, S. Di, Xiaodong Yu, M. Swany, Dingwen Tao, F. Cappello","doi":"10.1145/3577193.3593706","DOIUrl":"https://doi.org/10.1145/3577193.3593706","url":null,"abstract":"Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose gpuLZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate gpuLZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that gpuLZ achieves up to 272.1× speedup on A4000 and up to 1.4× higher compression ratio compared to state-of-the-art solutions.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124779881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs HEAT:一种高效且经济的cpu协同过滤推荐训练系统

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-04-14 DOI: 10.1145/3577193.3593717

Chengming Zhang, Shaden Smith, Baixi Sun, Jiannan Tian, Jon Soifer, Xiaodong Yu, S. Song, Yuxiong He, Dingwen Tao

Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2× speedup over existing CPU solution and 4.5× speedup and 7.9× cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.

协同过滤(CF)已被证明是最有效的推荐技术之一。在所有CF方法中，SimpleX是最先进的方法，它采用了一种新颖的损失函数和适当数量的负样本。然而，目前还没有在多核cpu上优化SimpleX的工作，导致性能有限。为此，我们对现有的SimpleX实现进行了深入的概要分析和分析，并确定了它们的性能瓶颈，包括:(1)不规则的内存访问，(2)不必要的内存拷贝，以及(3)冗余计算。为了解决这些问题，我们提出了一个高效的CF训练系统(称为HEAT)，该系统充分启用了现代cpu的多级缓存和多线程功能。具体来说，HEAT的优化有三个方面:(1)它对嵌入矩阵进行平铺以增加数据局部性并减少缓存丢失(从而减少读取延迟);(2)通过并行化向量积而不是矩阵乘法来优化随机梯度下降(SGD)，特别是其中的相似性计算，避免了矩阵数据准备的内存拷贝;(3)积极地将前向阶段的中间结果重用到后向阶段，减少冗余计算。在使用x86和arm架构处理器的五个广泛使用的数据集上进行的评估表明，使用NVIDIA V100 GPU的HEAT比现有的CPU解决方案实现了高达45.2倍的加速提升，在云端实现了4.5倍的加速提升和7.9倍的成本降低。

{"title":"HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs","authors":"Chengming Zhang, Shaden Smith, Baixi Sun, Jiannan Tian, Jon Soifer, Xiaodong Yu, S. Song, Yuxiong He, Dingwen Tao","doi":"10.1145/3577193.3593717","DOIUrl":"https://doi.org/10.1145/3577193.3593717","url":null,"abstract":"Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2× speedup over existing CPU solution and 4.5× speedup and 7.9× cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116969040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training 基于张量-专家-数据并行的混合专家训练优化方法

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-03-11 DOI: 10.1145/3577193.3593704

Siddharth Singh, Olatunji Ruwase, A. Awan, Samyam Rajbhandari, Yuxiong He, A. Bhatele

Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4--8× larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.

混合专家(MoE)是一种神经网络架构，它将稀疏激活的专家块添加到基本模型中，在不影响计算成本的情况下增加参数的数量。然而，目前的分布式深度学习框架在训练具有大型基础模型的高质量MoE模型的能力方面受到限制。在这项工作中，我们提出了DeepSpeed-TED，这是一种新颖的三维混合并行算法，它结合了数据、张量和专家并行性，使MoE模型的训练比目前最先进的基础模型大4- 8倍。我们还描述了优化器步骤中的内存优化，以及消除不必要数据移动的通信优化。我们在DeepSpeed中实现了我们的方法，并在128 V100 gpu上训练400亿参数的MoE模型(16位专家的67亿基本模型)时，在基线(即没有我们的通信优化)上实现了26%的加速。

引用次数: 0

SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation 高效和可扩展的水平扩散天气模板计算的空间加速

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-03-06 DOI: 10.1145/3577193.3593719

Gagandeep Singh, Alireza Khodamoradi, K. Denolf, Jack Lo, Juan G'omez-Luna, Joseph Melber, Andra Bisca, H. Corporaal, O. Mutlu

Fast and accurate climate simulations and weather predictions are critical for understanding and preparing for the impact of climate change. Real-world climate and weather simulations involve the use of complex compound stencil kernels, which are composed of a combination of different stencils. Horizontal diffusion is one such important compound stencil found in many climate and weather prediction models. Its computation involves a large amount of data access and manipulation that leads to two main issues on current computing systems. First, such compound stencils have high memory bandwidth demands as they require large amounts of data access. Second, compound stencils have complex data access patterns and poor data locality, as the memory access pattern is typically irregular with low arithmetic intensity. As a result, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. Recent works propose using FPGAs as an alternative to traditional CPU and GPU-based systems to accelerate weather stencil kernels. However, we observe that stencil computation cannot leverage the bit-level flexibility available on an FPGA because of its complex memory access patterns, leading to high hardware resource utilization and low peak performance. We introduce SPARTA, a novel spatial accelerator for horizontal diffusion weather stencil computation. We exploit the two-dimensional spatial architecture to efficiently accelerate the horizontal diffusion stencil by designing the first scaled-out spatial accelerator using the MLIR (Multi-Level Intermediate Representation) compiler framework. We evaluate SPARTA on a real cutting-edge AMD-Xilinx Versal AI Engine (AIE) spatial architecture. Our real-system evaluation results demonstrate that SPARTA outperforms state-of-the-art CPU, GPU, and FPGA implementations by 17.1×, 1.2×, and 2.1×, respectively. Compared to the most energy-efficient design on an HBM-based FPGA, SPARTA provides 2.43× higher energy efficiency. Our results reveal that balancing workload across the available processing resources is crucial in achieving high performance on spatial architectures. We also implement and evaluate five elementary stencils that are commonly used as benchmarks for stencil computation research. We freely open-source all our implementations to aid future research in stencil computation and spatial computing systems at https://github.com/CMU-SAFARI/SPARTA.

快速准确的气候模拟和天气预报对于了解和应对气候变化的影响至关重要。现实世界的气候和天气模拟涉及使用复杂的复合模板内核，它由不同模板的组合组成。水平扩散是在许多气候和天气预报模型中发现的一种重要的复合模板。它的计算涉及大量的数据访问和操作，这导致了当前计算系统的两个主要问题。首先，这种复合模板需要大量的数据访问，因此对内存带宽的要求很高。其次，复合模板的数据访问模式复杂，数据局部性差，内存访问模式不规则，算术强度低。因此，最先进的CPU和GPU实现受到性能限制和高能耗的影响。最近的工作建议使用fpga来替代传统的基于CPU和gpu的系统来加速天气模板内核。然而，我们观察到，由于其复杂的内存访问模式，模板计算不能利用FPGA上可用的位级灵活性，导致高硬件资源利用率和低峰值性能。介绍了一种用于水平扩散天气模板计算的新型空间加速器SPARTA。我们利用二维空间架构，设计了第一个横向扩展空间加速器，并使用多层中间表示(Multi-Level Intermediate Representation, MLIR)编译器框架来有效地加速水平扩散模板。我们在真正尖端的AMD-Xilinx通用人工智能引擎(AIE)空间架构上评估SPARTA。我们的实际系统评估结果表明，SPARTA比最先进的CPU、GPU和FPGA实现分别高出17.1倍、1.2倍和2.1倍。与基于hbm的FPGA上最节能的设计相比，SPARTA的能效提高了2.43倍。我们的研究结果表明，在可用的处理资源之间平衡工作负载对于在空间架构上实现高性能至关重要。我们还实现和评估了五个基本模板，这些模板通常用作模板计算研究的基准。我们免费开放我们所有的实现，以帮助未来的研究在模板计算和空间计算系统在https://github.com/CMU-SAFARI/SPARTA。

{"title":"SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation","authors":"Gagandeep Singh, Alireza Khodamoradi, K. Denolf, Jack Lo, Juan G'omez-Luna, Joseph Melber, Andra Bisca, H. Corporaal, O. Mutlu","doi":"10.1145/3577193.3593719","DOIUrl":"https://doi.org/10.1145/3577193.3593719","url":null,"abstract":"Fast and accurate climate simulations and weather predictions are critical for understanding and preparing for the impact of climate change. Real-world climate and weather simulations involve the use of complex compound stencil kernels, which are composed of a combination of different stencils. Horizontal diffusion is one such important compound stencil found in many climate and weather prediction models. Its computation involves a large amount of data access and manipulation that leads to two main issues on current computing systems. First, such compound stencils have high memory bandwidth demands as they require large amounts of data access. Second, compound stencils have complex data access patterns and poor data locality, as the memory access pattern is typically irregular with low arithmetic intensity. As a result, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. Recent works propose using FPGAs as an alternative to traditional CPU and GPU-based systems to accelerate weather stencil kernels. However, we observe that stencil computation cannot leverage the bit-level flexibility available on an FPGA because of its complex memory access patterns, leading to high hardware resource utilization and low peak performance. We introduce SPARTA, a novel spatial accelerator for horizontal diffusion weather stencil computation. We exploit the two-dimensional spatial architecture to efficiently accelerate the horizontal diffusion stencil by designing the first scaled-out spatial accelerator using the MLIR (Multi-Level Intermediate Representation) compiler framework. We evaluate SPARTA on a real cutting-edge AMD-Xilinx Versal AI Engine (AIE) spatial architecture. Our real-system evaluation results demonstrate that SPARTA outperforms state-of-the-art CPU, GPU, and FPGA implementations by 17.1×, 1.2×, and 2.1×, respectively. Compared to the most energy-efficient design on an HBM-based FPGA, SPARTA provides 2.43× higher energy efficiency. Our results reveal that balancing workload across the available processing resources is crucial in achieving high performance on spatial architectures. We also implement and evaluate five elementary stencils that are commonly used as benchmarks for stencil computation research. We freely open-source all our implementations to aid future research in stencil computation and spatial computing systems at https://github.com/CMU-SAFARI/SPARTA.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130570545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

CMLCompiler: A Unified Compiler for Classical Machine Learning CMLCompiler:经典机器学习的统一编译器

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-01-31 DOI: 10.1145/3577193.3593710

Xu Wen, Wanling Gao, An-Dong Li, Lei Wang, Zihan Jiang

Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified compiler, called CMLCompiler, for CML inference. We propose two unified abstractions: operator representations and extended computational graphs. The CMLCompiler framework performs the conversion and graph optimization based on two unified abstractions, then outputs an optimized computational graph to DL compilers or frameworks. We implement CMLCompiler on TVM. The evaluation shows CMLCompiler's portability and superior performance. It achieves up to 4.38× speedup on CPU, 3.31× speedup on GPU, and 5.09× speedup on IoT devices, compared to the state-of-the-art solutions --- scikit-learn, intel sklearn, and hummingbird. Our performance of CML and DL mixed pipelines achieves up to 3.04x speedup compared with cross-framework implementations. The project documents and source code are available at https://www.computercouncil.org/cmlcompiler.

经典机器学习(CML)占据了生产应用中近一半的机器学习管道。不幸的是，它不能充分利用最先进的设备，性能很差。如果没有统一的框架，深度学习(DL)和CML的混合部署也会遇到严重的性能和可移植性问题。本文设计了一个用于CML推理的统一编译器CMLCompiler。我们提出了两个统一的抽象:算子表示和扩展计算图。CMLCompiler框架基于两个统一的抽象执行转换和图形优化，然后将优化后的计算图输出到DL编译器或框架。我们在TVM上实现了CMLCompiler。评估结果表明CMLCompiler具有良好的可移植性和性能。与scikit-learn、intel sklearn和hummingbird等最先进的解决方案相比，它在CPU上实现了4.38倍的加速，在GPU上实现了3.31倍的加速，在物联网设备上实现了5.09倍的加速。与跨框架实现相比，我们的CML和DL混合管道的性能提高了3.04倍。项目文档和源代码可在https://www.computercouncil.org/cmlcompiler上获得。

{"title":"CMLCompiler: A Unified Compiler for Classical Machine Learning","authors":"Xu Wen, Wanling Gao, An-Dong Li, Lei Wang, Zihan Jiang","doi":"10.1145/3577193.3593710","DOIUrl":"https://doi.org/10.1145/3577193.3593710","url":null,"abstract":"Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified compiler, called CMLCompiler, for CML inference. We propose two unified abstractions: operator representations and extended computational graphs. The CMLCompiler framework performs the conversion and graph optimization based on two unified abstractions, then outputs an optimized computational graph to DL compilers or frameworks. We implement CMLCompiler on TVM. The evaluation shows CMLCompiler's portability and superior performance. It achieves up to 4.38× speedup on CPU, 3.31× speedup on GPU, and 5.09× speedup on IoT devices, compared to the state-of-the-art solutions --- scikit-learn, intel sklearn, and hummingbird. Our performance of CML and DL mixed pipelines achieves up to 3.04x speedup compared with cross-framework implementations. The project documents and source code are available at https://www.computercouncil.org/cmlcompiler.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123413943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Wafer-Scale Fast Fourier Transforms 晶圆尺度快速傅里叶变换

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2022-09-29 DOI: 10.1145/3577193.3593708

Marcelo Orenes-Vera, I. Sharapov, R. Schreiber, M. Jacquelin, Philippe Vandermersch, Sharan Chetlur

We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a n3 problem with up to n2 PEs. At this point, a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. We analyze in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a 5123 complex input array using a 512 × 512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.

我们已经在Cerebras CS-2上实现了一维、二维和三维阵列的快速傅立叶变换，该系统的存储和处理元件位于单个硅片上。晶圆级引擎(WSE)包含大约850,000个处理元素(pe)的二维网格，具有快速的本地内存和同样快速的最近邻互连。我们的晶圆级FFT (wsFFT)用最多n2个pe并行处理n3问题。此时，PE每个超步只处理3D域的单个向量(称为铅笔)，其中三个超步中的每一个都沿着输入数组的三个轴之一执行FFT。在超级步骤之间，wsFFT重新分配(转置)数据，将每个一维铅笔的所有元素转换到单个PE的存储器中。每次重新分配都会导致沿一个网格维度进行全对全通信。在给定并行度的情况下，pe对之间传输的消息大小可以小到一个单词。理论上，由于网格的二分带宽有限，它不是全对全通信的理想选择。然而，在WSE上连接pe的网格完全位于晶圆上，即使是很小的消息也能达到接近峰值的带宽。在FP16和FP32两种精度下，详细分析了计算时间和通信时间，以及弱尺度和强尺度。使用CS-2上的32位算法，我们使用片上pe的512 × 512子网格，实现了5123复杂输入阵列的959微秒3D FFT。对于这个问题规模，这是有史以来最大的并行化，也是第一个打破毫秒限制的实现。

{"title":"Wafer-Scale Fast Fourier Transforms","authors":"Marcelo Orenes-Vera, I. Sharapov, R. Schreiber, M. Jacquelin, Philippe Vandermersch, Sharan Chetlur","doi":"10.1145/3577193.3593708","DOIUrl":"https://doi.org/10.1145/3577193.3593708","url":null,"abstract":"We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a n3 problem with up to n2 PEs. At this point, a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. We analyze in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a 5123 complex input array using a 512 × 512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132223124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications PERKS:用于迭代内存绑定GPU应用程序的位置优化执行模型

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2022-04-05 DOI: 10.1145/3577193.3593705

Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka

Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of 2.12x for 2D stencils and 1.24x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of 4.86x in smaller SpMV datasets from SuiteSparse and 1.43x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.

迭代内存约束求解器通常出现在高性能计算代码中。典型的GPU实现在主机端有一个循环，它调用GPU内核的时间和算法步骤一样多。每个内核的终止隐式地在每个时间步推进解决方案后充当所需的屏障。我们提出了一个运行内存约束迭代GPU内核的执行模型:持久内核(PERsistent kernels, PERKS)。在该模型中，时间循环被移动到持久内核中，并使用设备范围的屏障进行同步。然后，我们通过在未使用的寄存器和共享内存中缓存每个时间步的输出子集来减少设备内存的流量。PERKS可以推广到任何迭代解算器:它们在很大程度上独立于解算器的执行。我们解释了PERKS的设计原理，并展示了PERKS在各种迭代2D/3D模板基准测试中的有效性(2D模板的几何加速为2.12倍，3D模板的几何加速为1.24倍)，以及Krylov子空间共轭梯度求解器(来自SuiteSparse的较小SpMV数据集的几何加速为4.86倍，大型SpMV数据集的几何加速为1.43倍)。所有基于perks的实现可在:https://github.com/neozhang307/PERKS。

{"title":"PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications","authors":"Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka","doi":"10.1145/3577193.3593705","DOIUrl":"https://doi.org/10.1145/3577193.3593705","url":null,"abstract":"Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of 2.12x for 2D stencils and 1.24x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of 4.86x in smaller SpMV datasets from SuiteSparse and 1.43x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"67 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133181574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Proceedings of the 37th International Conference on Supercomputing 第37届国际超级计算会议论文集

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 1900-01-01 DOI: 10.1145/3577193

引用次数: 0