Jou-An Chen, Hsin-Hsuan Sung, Xipeng Shen, Sutanay Choudhury, Ang Li
Recent studies have shown that Binary Graph Neural Networks (GNNs) are promising for saving computations of GNNs through binarized tensors. Prior work, however, mainly focused on algorithm designs or training techniques, leaving it open to how to materialize the performance potential on accelerator hardware fully. This work redesigns the binary GNN inference backend from the efficiency perspective. It fills the gap by proposing a series of abstractions and techniques to map binary GNNs and their computations best to fit the nature of bit manipulations on GPUs. Results on real-world graphs with GCNs, GraphSAGE, and GraphSAINT show that the proposed techniques outperform state-of-the-art binary GNN implementations by 8-22X with the same accuracy maintained. BitGNN code is publicly available.1.
{"title":"BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUs","authors":"Jou-An Chen, Hsin-Hsuan Sung, Xipeng Shen, Sutanay Choudhury, Ang Li","doi":"10.1145/3577193.3593725","DOIUrl":"https://doi.org/10.1145/3577193.3593725","url":null,"abstract":"Recent studies have shown that Binary Graph Neural Networks (GNNs) are promising for saving computations of GNNs through binarized tensors. Prior work, however, mainly focused on algorithm designs or training techniques, leaving it open to how to materialize the performance potential on accelerator hardware fully. This work redesigns the binary GNN inference backend from the efficiency perspective. It fills the gap by proposing a series of abstractions and techniques to map binary GNNs and their computations best to fit the nature of bit manipulations on GPUs. Results on real-world graphs with GCNs, GraphSAGE, and GraphSAINT show that the proposed techniques outperform state-of-the-art binary GNN implementations by 8-22X with the same accuracy maintained. BitGNN code is publicly available.1.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"266 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122911368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing since an efficient GEMM implementation is essential for the performance of these calculations. While researchers often strive for faster performance by using large computing platforms, the increased scale of these systems can raise concerns about hardware and software reliability. In this paper, we present a design of a high-performance GPU-based GEMM that integrates an algorithm-based fault tolerance scheme that detects and corrects silent data corruptions at computing units on-the-fly. We explore fault-tolerant designs for GEMM at the thread, warp, and threadblock levels, and also provide a baseline GEMM implementation that is competitive with or faster than the state-of-the-art, closed-source cuBLAS GEMM. We present a kernel fusion strategy to overlap and mitigate the memory latency due to fault tolerance with the original GEMM computation. To support a wide range of input matrix shapes and reduce development costs, we present a template-based approach for automatic code generation for both fault-tolerant and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA Tesla T4 and A100 server GPUs. Our experimental results demonstrate that our baseline GEMM shows comparable or superior performance compared to the closed-source cuBLAS. Compared with the prior state-of-the-art non-fused fault-tolerant GEMM, our optimal fused strategy achieves a 39.04% speedup on average. In addition, our fault-tolerant GEMM incurs only a minimal overhead (8.89% on average) compared to cuBLAS even with hundreds of errors injected per minute. For irregularly shaped inputs, the code generator-generated kernels show remarkable speedups of 160% ~ 183.5% and 148.55% ~ 165.12% for fault-tolerant and non-fault-tolerant GEMMs, respectively, which outperforms cuBLAS by up to 41.40%.
通用矩阵乘法(GEMM)是机器学习和科学计算等各种应用的关键算法,因为有效的GEMM实现对于这些计算的性能至关重要。虽然研究人员经常通过使用大型计算平台来争取更快的性能,但这些系统规模的增加可能会引起对硬件和软件可靠性的担忧。在本文中,我们提出了一种基于gpu的高性能GEMM设计,该GEMM集成了一种基于算法的容错方案,该方案可以实时检测和纠正计算单元中的静默数据损坏。我们在线程、warp和线程块级别探索了GEMM的容错设计,并提供了与最先进的闭源cuBLAS GEMM竞争或更快的基线GEMM实现。我们提出了一种核融合策略,以重叠和减轻由于容错导致的内存延迟与原始GEMM计算。为了支持广泛的输入矩阵形状并降低开发成本,我们提出了一种基于模板的方法,用于容错和非容错GEMM实现的自动代码生成。我们评估了我们在NVIDIA Tesla T4和A100服务器gpu上的工作。我们的实验结果表明,与闭源cuBLAS相比,我们的基线GEMM具有相当或更好的性能。与之前最先进的非融合容错GEMM相比,我们的最优融合策略平均实现了39.04%的加速。此外,即使每分钟注入数百个错误,与cuBLAS相比,我们的容错gem只会产生最小的开销(平均8.89%)。对于不规则形状的输入,代码生成器生成的内核在容错和非容错gem上分别表现出160% ~ 183.5%和148.55% ~ 165.12%的显著提速,比cuBLAS的提速幅度高达41.40%。
{"title":"Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs","authors":"Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen","doi":"10.1145/3577193.3593715","DOIUrl":"https://doi.org/10.1145/3577193.3593715","url":null,"abstract":"General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing since an efficient GEMM implementation is essential for the performance of these calculations. While researchers often strive for faster performance by using large computing platforms, the increased scale of these systems can raise concerns about hardware and software reliability. In this paper, we present a design of a high-performance GPU-based GEMM that integrates an algorithm-based fault tolerance scheme that detects and corrects silent data corruptions at computing units on-the-fly. We explore fault-tolerant designs for GEMM at the thread, warp, and threadblock levels, and also provide a baseline GEMM implementation that is competitive with or faster than the state-of-the-art, closed-source cuBLAS GEMM. We present a kernel fusion strategy to overlap and mitigate the memory latency due to fault tolerance with the original GEMM computation. To support a wide range of input matrix shapes and reduce development costs, we present a template-based approach for automatic code generation for both fault-tolerant and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA Tesla T4 and A100 server GPUs. Our experimental results demonstrate that our baseline GEMM shows comparable or superior performance compared to the closed-source cuBLAS. Compared with the prior state-of-the-art non-fused fault-tolerant GEMM, our optimal fused strategy achieves a 39.04% speedup on average. In addition, our fault-tolerant GEMM incurs only a minimal overhead (8.89% on average) compared to cuBLAS even with hundreds of errors injected per minute. For irregularly shaped inputs, the code generator-generated kernels show remarkable speedups of 160% ~ 183.5% and 148.55% ~ 165.12% for fault-tolerant and non-fault-tolerant GEMMs, respectively, which outperforms cuBLAS by up to 41.40%.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116360146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Zhang, Jiannan Tian, S. Di, Xiaodong Yu, M. Swany, Dingwen Tao, F. Cappello
Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose gpuLZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate gpuLZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that gpuLZ achieves up to 272.1× speedup on A4000 and up to 1.4× higher compression ratio compared to state-of-the-art solutions.
{"title":"GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs","authors":"Bo Zhang, Jiannan Tian, S. Di, Xiaodong Yu, M. Swany, Dingwen Tao, F. Cappello","doi":"10.1145/3577193.3593706","DOIUrl":"https://doi.org/10.1145/3577193.3593706","url":null,"abstract":"Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose gpuLZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate gpuLZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that gpuLZ achieves up to 272.1× speedup on A4000 and up to 1.4× higher compression ratio compared to state-of-the-art solutions.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124779881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chengming Zhang, Shaden Smith, Baixi Sun, Jiannan Tian, Jon Soifer, Xiaodong Yu, S. Song, Yuxiong He, Dingwen Tao
Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2× speedup over existing CPU solution and 4.5× speedup and 7.9× cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.
{"title":"HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs","authors":"Chengming Zhang, Shaden Smith, Baixi Sun, Jiannan Tian, Jon Soifer, Xiaodong Yu, S. Song, Yuxiong He, Dingwen Tao","doi":"10.1145/3577193.3593717","DOIUrl":"https://doi.org/10.1145/3577193.3593717","url":null,"abstract":"Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2× speedup over existing CPU solution and 4.5× speedup and 7.9× cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116969040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siddharth Singh, Olatunji Ruwase, A. Awan, Samyam Rajbhandari, Yuxiong He, A. Bhatele
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4--8× larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
{"title":"A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training","authors":"Siddharth Singh, Olatunji Ruwase, A. Awan, Samyam Rajbhandari, Yuxiong He, A. Bhatele","doi":"10.1145/3577193.3593704","DOIUrl":"https://doi.org/10.1145/3577193.3593704","url":null,"abstract":"Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4--8× larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129842976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gagandeep Singh, Alireza Khodamoradi, K. Denolf, Jack Lo, Juan G'omez-Luna, Joseph Melber, Andra Bisca, H. Corporaal, O. Mutlu
Fast and accurate climate simulations and weather predictions are critical for understanding and preparing for the impact of climate change. Real-world climate and weather simulations involve the use of complex compound stencil kernels, which are composed of a combination of different stencils. Horizontal diffusion is one such important compound stencil found in many climate and weather prediction models. Its computation involves a large amount of data access and manipulation that leads to two main issues on current computing systems. First, such compound stencils have high memory bandwidth demands as they require large amounts of data access. Second, compound stencils have complex data access patterns and poor data locality, as the memory access pattern is typically irregular with low arithmetic intensity. As a result, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. Recent works propose using FPGAs as an alternative to traditional CPU and GPU-based systems to accelerate weather stencil kernels. However, we observe that stencil computation cannot leverage the bit-level flexibility available on an FPGA because of its complex memory access patterns, leading to high hardware resource utilization and low peak performance. We introduce SPARTA, a novel spatial accelerator for horizontal diffusion weather stencil computation. We exploit the two-dimensional spatial architecture to efficiently accelerate the horizontal diffusion stencil by designing the first scaled-out spatial accelerator using the MLIR (Multi-Level Intermediate Representation) compiler framework. We evaluate SPARTA on a real cutting-edge AMD-Xilinx Versal AI Engine (AIE) spatial architecture. Our real-system evaluation results demonstrate that SPARTA outperforms state-of-the-art CPU, GPU, and FPGA implementations by 17.1×, 1.2×, and 2.1×, respectively. Compared to the most energy-efficient design on an HBM-based FPGA, SPARTA provides 2.43× higher energy efficiency. Our results reveal that balancing workload across the available processing resources is crucial in achieving high performance on spatial architectures. We also implement and evaluate five elementary stencils that are commonly used as benchmarks for stencil computation research. We freely open-source all our implementations to aid future research in stencil computation and spatial computing systems at https://github.com/CMU-SAFARI/SPARTA.
{"title":"SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation","authors":"Gagandeep Singh, Alireza Khodamoradi, K. Denolf, Jack Lo, Juan G'omez-Luna, Joseph Melber, Andra Bisca, H. Corporaal, O. Mutlu","doi":"10.1145/3577193.3593719","DOIUrl":"https://doi.org/10.1145/3577193.3593719","url":null,"abstract":"Fast and accurate climate simulations and weather predictions are critical for understanding and preparing for the impact of climate change. Real-world climate and weather simulations involve the use of complex compound stencil kernels, which are composed of a combination of different stencils. Horizontal diffusion is one such important compound stencil found in many climate and weather prediction models. Its computation involves a large amount of data access and manipulation that leads to two main issues on current computing systems. First, such compound stencils have high memory bandwidth demands as they require large amounts of data access. Second, compound stencils have complex data access patterns and poor data locality, as the memory access pattern is typically irregular with low arithmetic intensity. As a result, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. Recent works propose using FPGAs as an alternative to traditional CPU and GPU-based systems to accelerate weather stencil kernels. However, we observe that stencil computation cannot leverage the bit-level flexibility available on an FPGA because of its complex memory access patterns, leading to high hardware resource utilization and low peak performance. We introduce SPARTA, a novel spatial accelerator for horizontal diffusion weather stencil computation. We exploit the two-dimensional spatial architecture to efficiently accelerate the horizontal diffusion stencil by designing the first scaled-out spatial accelerator using the MLIR (Multi-Level Intermediate Representation) compiler framework. We evaluate SPARTA on a real cutting-edge AMD-Xilinx Versal AI Engine (AIE) spatial architecture. Our real-system evaluation results demonstrate that SPARTA outperforms state-of-the-art CPU, GPU, and FPGA implementations by 17.1×, 1.2×, and 2.1×, respectively. Compared to the most energy-efficient design on an HBM-based FPGA, SPARTA provides 2.43× higher energy efficiency. Our results reveal that balancing workload across the available processing resources is crucial in achieving high performance on spatial architectures. We also implement and evaluate five elementary stencils that are commonly used as benchmarks for stencil computation research. We freely open-source all our implementations to aid future research in stencil computation and spatial computing systems at https://github.com/CMU-SAFARI/SPARTA.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130570545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xu Wen, Wanling Gao, An-Dong Li, Lei Wang, Zihan Jiang
Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified compiler, called CMLCompiler, for CML inference. We propose two unified abstractions: operator representations and extended computational graphs. The CMLCompiler framework performs the conversion and graph optimization based on two unified abstractions, then outputs an optimized computational graph to DL compilers or frameworks. We implement CMLCompiler on TVM. The evaluation shows CMLCompiler's portability and superior performance. It achieves up to 4.38× speedup on CPU, 3.31× speedup on GPU, and 5.09× speedup on IoT devices, compared to the state-of-the-art solutions --- scikit-learn, intel sklearn, and hummingbird. Our performance of CML and DL mixed pipelines achieves up to 3.04x speedup compared with cross-framework implementations. The project documents and source code are available at https://www.computercouncil.org/cmlcompiler.
{"title":"CMLCompiler: A Unified Compiler for Classical Machine Learning","authors":"Xu Wen, Wanling Gao, An-Dong Li, Lei Wang, Zihan Jiang","doi":"10.1145/3577193.3593710","DOIUrl":"https://doi.org/10.1145/3577193.3593710","url":null,"abstract":"Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified compiler, called CMLCompiler, for CML inference. We propose two unified abstractions: operator representations and extended computational graphs. The CMLCompiler framework performs the conversion and graph optimization based on two unified abstractions, then outputs an optimized computational graph to DL compilers or frameworks. We implement CMLCompiler on TVM. The evaluation shows CMLCompiler's portability and superior performance. It achieves up to 4.38× speedup on CPU, 3.31× speedup on GPU, and 5.09× speedup on IoT devices, compared to the state-of-the-art solutions --- scikit-learn, intel sklearn, and hummingbird. Our performance of CML and DL mixed pipelines achieves up to 3.04x speedup compared with cross-framework implementations. The project documents and source code are available at https://www.computercouncil.org/cmlcompiler.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123413943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcelo Orenes-Vera, I. Sharapov, R. Schreiber, M. Jacquelin, Philippe Vandermersch, Sharan Chetlur
We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a n3 problem with up to n2 PEs. At this point, a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. We analyze in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a 5123 complex input array using a 512 × 512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.
{"title":"Wafer-Scale Fast Fourier Transforms","authors":"Marcelo Orenes-Vera, I. Sharapov, R. Schreiber, M. Jacquelin, Philippe Vandermersch, Sharan Chetlur","doi":"10.1145/3577193.3593708","DOIUrl":"https://doi.org/10.1145/3577193.3593708","url":null,"abstract":"We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a n3 problem with up to n2 PEs. At this point, a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. We analyze in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a 5123 complex input array using a 512 × 512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132223124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka
Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of 2.12x for 2D stencils and 1.24x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of 4.86x in smaller SpMV datasets from SuiteSparse and 1.43x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.
{"title":"PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications","authors":"Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka","doi":"10.1145/3577193.3593705","DOIUrl":"https://doi.org/10.1145/3577193.3593705","url":null,"abstract":"Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of 2.12x for 2D stencils and 1.24x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of 4.86x in smaller SpMV datasets from SuiteSparse and 1.43x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"67 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133181574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 37th International Conference on Supercomputing","authors":"","doi":"10.1145/3577193","DOIUrl":"https://doi.org/10.1145/3577193","url":null,"abstract":"","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131777787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}