2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第10页

Power Constrained Autotuning using Graph Neural Networks 使用图神经网络的功率约束自动调谐

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-02-22 DOI: 10.1109/IPDPS54959.2023.00060

Akashnil Dutta, JeeWhan Choi, A. Jannesari

Recent advances in multi and many-core processors have led to significant improvements in the performance of scientific computing applications. However, the addition of a large number of complex cores have also increased the overall power consumption, and power has become a first-order design constraint in modern processors. While we can limit power consumption by simply applying software-based power constraints, applying them blindly will lead to non-trivial performance degradation. To address the challenge of improving the performance, power, and energy efficiency of scientific applications on modern multi-core processors, we propose a novel Graph Neural Network based auto-tuning approach that (i) optimizes runtime performance at pre-defined power constraints, and (ii) simultaneously optimizes for runtime performance and energy efficiency by minimizing the energy-delay product. The key idea behind this approach lies in modeling parallel code regions as flow-aware code graphs to capture both semantic and structural code features. We demonstrate the efficacy of our approach by conducting an extensive evaluation on 30 benchmarks and proxy-/mini-applications with 68 OpenMP code regions. Our approach identifies OpenMP configurations at different power constraints that yield a geometric mean performance improvement of more than 25% and 13% over the default OpenMP configuration on a 32-core Skylake and a 16-core Haswell processor respectively. In addition, when we optimize for the energy-delay product, the OpenMP configurations selected by our auto-tuner demonstrate both performance improvement of 21% and 11% and energy reduction of 29% and 18% over the default OpenMP configuration at Thermal Design Power for the same Skylake and Haswell processors, respectively.

最近在多核和多核处理器方面的进展已经导致了科学计算应用性能的显著提高。然而，大量复杂内核的加入也增加了整体功耗，功耗已成为现代处理器的一级设计约束。虽然我们可以通过简单地应用基于软件的功率约束来限制功耗，但盲目地应用它们将导致严重的性能下降。为了解决在现代多核处理器上提高科学应用的性能、功耗和能源效率的挑战，我们提出了一种新的基于图神经网络的自动调谐方法，该方法(i)在预定义的功率约束下优化运行时性能，(ii)通过最小化能量延迟产品同时优化运行时性能和能源效率。这种方法背后的关键思想是将并行代码区域建模为流感知代码图，以捕获语义和结构代码特征。我们通过对具有68个OpenMP代码区域的30个基准和代理/迷你应用程序进行广泛评估来证明我们方法的有效性。我们的方法确定了在不同功率限制下的OpenMP配置，在32核Skylake和16核Haswell处理器上的默认OpenMP配置分别产生超过25%和13%的几何平均性能改进。此外，当我们对能量延迟产品进行优化时，我们的自动调谐器选择的OpenMP配置显示，在相同的Skylake和Haswell处理器的热设计功率下，与默认OpenMP配置相比，性能分别提高了21%和11%，能耗分别降低了29%和18%。

{"title":"Power Constrained Autotuning using Graph Neural Networks","authors":"Akashnil Dutta, JeeWhan Choi, A. Jannesari","doi":"10.1109/IPDPS54959.2023.00060","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00060","url":null,"abstract":"Recent advances in multi and many-core processors have led to significant improvements in the performance of scientific computing applications. However, the addition of a large number of complex cores have also increased the overall power consumption, and power has become a first-order design constraint in modern processors. While we can limit power consumption by simply applying software-based power constraints, applying them blindly will lead to non-trivial performance degradation. To address the challenge of improving the performance, power, and energy efficiency of scientific applications on modern multi-core processors, we propose a novel Graph Neural Network based auto-tuning approach that (i) optimizes runtime performance at pre-defined power constraints, and (ii) simultaneously optimizes for runtime performance and energy efficiency by minimizing the energy-delay product. The key idea behind this approach lies in modeling parallel code regions as flow-aware code graphs to capture both semantic and structural code features. We demonstrate the efficacy of our approach by conducting an extensive evaluation on 30 benchmarks and proxy-/mini-applications with 68 OpenMP code regions. Our approach identifies OpenMP configurations at different power constraints that yield a geometric mean performance improvement of more than 25% and 13% over the default OpenMP configuration on a 32-core Skylake and a 16-core Haswell processor respectively. In addition, when we optimize for the energy-delay product, the OpenMP configurations selected by our auto-tuner demonstrate both performance improvement of 21% and 11% and energy reduction of 29% and 18% over the default OpenMP configuration at Thermal Design Power for the same Skylake and Haswell processors, respectively.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129847544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs SCONNA:一种基于随机计算的超快速、高效推理整量化cnn的光加速器

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-02-14 DOI: 10.1109/IPDPS54959.2023.00061

Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Ishan G. Thakkar, S. A. Salehi, J. Hastings

Convolutional Neural Networks (CNNs) are used extensively for artificial intelligence applications due to their record-breaking accuracy. For efficient and swift hardware-based acceleration, CNNs are typically quantized to have integer input/weight parameters. The acceleration of a CNN inference task uses convolution operations that are typically transformed into vector-dot-product (VDP) operations. Several photonic microring resonators (MRRs) based hardware architectures have been proposed to accelerate integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing photonic MRR-based analog accelerators exhibit a very strong trade-off between the achievable input/weight precision and VDP operation size, which severely restricts their achievable VDP operation size for the quantized input/weight precision of 4 bits and higher. The restricted VDP operation size ultimately suppresses computing throughput to severely diminish the achievable performance benefits. To address this shortcoming, we for the first time present a merger of stochastic computing and MRR-based CNN accelerators. To leverage the innate precision flexibility of stochastic computing, we invent an MRR-based optical stochastic multiplier (OSM). We employ multiple OSMs in a cascaded manner using dense wavelength division multiplexing, to forge a novel Stochastic Computing based Optical Neural Network Accelerator (SCONNA). SCONNA achieves significantly high throughput and energy efficiency for accelerating inferences of high-precision quantized CNNs. Our evaluation for the inference of four modern CNNs at 8-bit input/weight precision indicates that SCONNA provides improvements of up to 66.5×, 90× and 91× in frames-per-second (FPS), FPS/W and FPS/W/mm2 respectively, on average over two photonic MRR-based analog CNN accelerators from prior work, with Top-1 accuracy drop of only up to 0.4% for large CNNs and up to 1.5% for small CNNs. We developed a transaction-level, event-driven python-based simulator for the evaluation of SCONNA and other accelerators (https://github.com/uky-UCAT/SC_ONN_SIM.git).

卷积神经网络(cnn)由于其破纪录的准确性而被广泛用于人工智能应用。为了高效和快速的基于硬件的加速，cnn通常被量化为具有整数输入/权重参数。CNN推理任务的加速使用卷积运算，卷积运算通常被转换为向量点积(VDP)运算。几种基于光子微环谐振器(mrr)的硬件架构已经被提出，以加速整数量化cnn，与它们的电子同行相比，具有显着更高的吞吐量和能量效率。然而，现有的基于光子核磁共振的模拟加速器在可实现的输入/重量精度和VDP运算大小之间表现出非常强的权衡，这严重限制了它们在量化输入/重量精度为4位或更高的情况下可实现的VDP运算大小。受限的VDP操作大小最终会抑制计算吞吐量，从而严重降低可实现的性能效益。为了解决这个缺点，我们首次提出了随机计算和基于核磁共振的CNN加速器的合并。为了利用随机计算固有的精确灵活性，我们发明了一种基于磁共振的光学随机乘法器(OSM)。我们采用密集波分复用，以级联的方式使用多个osm，构建了一种基于随机计算的新型光学神经网络加速器(SCONNA)。SCONNA为高精度量化cnn的加速推理实现了显著的高吞吐量和高能效。我们对四个8位输入/权重精度的现代CNN的推断进行了评估，结果表明，与之前的工作相比，SCONNA在帧数每秒(FPS)、帧数/W和帧数/W/mm2方面分别提供了高达66.5倍、90倍和91倍的改进，其中大型CNN的前1精度仅下降0.4%，小型CNN的前1精度下降1.5%。我们开发了一个事务级、事件驱动的基于python的模拟器，用于评估SCONNA和其他加速器(https://github.com/uky-UCAT/SC_ONN_SIM.git)。

{"title":"SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs","authors":"Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Ishan G. Thakkar, S. A. Salehi, J. Hastings","doi":"10.1109/IPDPS54959.2023.00061","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00061","url":null,"abstract":"Convolutional Neural Networks (CNNs) are used extensively for artificial intelligence applications due to their record-breaking accuracy. For efficient and swift hardware-based acceleration, CNNs are typically quantized to have integer input/weight parameters. The acceleration of a CNN inference task uses convolution operations that are typically transformed into vector-dot-product (VDP) operations. Several photonic microring resonators (MRRs) based hardware architectures have been proposed to accelerate integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing photonic MRR-based analog accelerators exhibit a very strong trade-off between the achievable input/weight precision and VDP operation size, which severely restricts their achievable VDP operation size for the quantized input/weight precision of 4 bits and higher. The restricted VDP operation size ultimately suppresses computing throughput to severely diminish the achievable performance benefits. To address this shortcoming, we for the first time present a merger of stochastic computing and MRR-based CNN accelerators. To leverage the innate precision flexibility of stochastic computing, we invent an MRR-based optical stochastic multiplier (OSM). We employ multiple OSMs in a cascaded manner using dense wavelength division multiplexing, to forge a novel Stochastic Computing based Optical Neural Network Accelerator (SCONNA). SCONNA achieves significantly high throughput and energy efficiency for accelerating inferences of high-precision quantized CNNs. Our evaluation for the inference of four modern CNNs at 8-bit input/weight precision indicates that SCONNA provides improvements of up to 66.5×, 90× and 91× in frames-per-second (FPS), FPS/W and FPS/W/mm2 respectively, on average over two photonic MRR-based analog CNN accelerators from prior work, with Top-1 accuracy drop of only up to 0.4% for large CNNs and up to 1.5% for small CNNs. We developed a transaction-level, event-driven python-based simulator for the evaluation of SCONNA and other accelerators (https://github.com/uky-UCAT/SC_ONN_SIM.git).","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134474400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training 利用精简神经网络的稀疏性优化大型模型训练

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-02-10 DOI: 10.1109/IPDPS54959.2023.00033

Siddharth Singh, A. Bhatele

Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely – data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.

大规模的神经网络并行训练是具有挑战性的，因为通信带来了巨大的开销。最近，深度学习研究人员开发了各种修剪算法，这些算法能够修剪(即设置为零)神经网络中80-90%的参数，从而产生与未修剪的父网络相同精度的稀疏子网络。在这项工作中，我们提出了一种新的方法，利用这些稀疏子网来优化并行深度学习的两种流行算法中的内存利用和通信，即数据并行和层间并行。我们将我们的方法集成到AxoNN中，AxoNN是一个高度可扩展的并行深度学习框架，依赖于数据和层间并行，并证明了通信时间和内存利用率的减少。在512个NVIDIA V100 gpu上，我们的优化将27亿个参数模型的内存消耗减少了74%，总通信时间减少了40%，从而提供了比AxoNN 34%的总体加速，比DeepSpeed-3D 32%，比Sputnik 46%，一个稀疏矩阵计算基线。

{"title":"Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training","authors":"Siddharth Singh, A. Bhatele","doi":"10.1109/IPDPS54959.2023.00033","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00033","url":null,"abstract":"Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely – data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114988294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Feature-based SpMV Performance Analysis on Contemporary Devices 基于特征的当代器件SpMV性能分析

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-02-08 DOI: 10.1109/IPDPS54959.2023.00072

Panagiotis Mpakos, D. Galanopoulos, Petros Anastasiadis, Nikela Papadopoulou, N. Koziris, G. Goumas

The SpMV kernel is characterized by high performance variation per input matrix and computing platform. While GPUs were considered State-of-the-Art for SpMV, with the emergence of advanced multicore CPUs and low-power FPGA accelerators, we need to revisit its performance and energy efficiency. This paper provides a high-level SpMV performance analysis based on structural features of matrices related to common bottlenecks of memory-bandwidth intensity, low ILP, load imbalance and memory latency overheads. Towards this, we create a wide artificial matrix dataset that spans these features and study the performance of different storage formats in nine modern HPC platforms; five CPUs, three GPUs and an FPGA. After validating our proposed methodology using real-world matrices, we analyze our extensive experimental results and draw key insights on the competitiveness of different target architectures for SpMV and the impact of each feature/bottleneck on its performance.

SpMV核的特点是每个输入矩阵和计算平台的高性能变化。虽然gpu被认为是SpMV的最先进技术，但随着先进的多核cpu和低功耗FPGA加速器的出现，我们需要重新审视其性能和能效。本文基于与内存带宽强度、低ILP、负载不平衡和内存延迟开销等常见瓶颈相关的矩阵结构特征，提供了一个高层次的SpMV性能分析。为此，我们创建了一个广泛的人工矩阵数据集，涵盖了这些特征，并研究了不同存储格式在九个现代HPC平台上的性能;五个cpu，三个gpu和一个FPGA。在使用现实世界的矩阵验证了我们提出的方法之后，我们分析了我们广泛的实验结果，并得出了SpMV不同目标体系结构的竞争力以及每个特征/瓶颈对其性能的影响的关键见解。

引用次数: 0

Accelerating CNN inference on long vector architectures via co-design 通过协同设计在长向量架构上加速CNN推理

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2022-12-22 DOI: 10.1109/IPDPS54959.2023.00024

Sonia Gupta, Nikela Papadopoulou, M. Pericàs

CPU-based inference can be deployed as an alternative to off-chip accelerators. In this context, emerging vector architectures are a promising option, owing to their high efficiency. Yet the large design space of convolutional algorithms and hardware implementations makes the selection of design options challenging. In this paper, we present our ongoing research into co-designing future vector architectures for CPU-based Convolutional Neural Networks (CNN) inference focusing on the im2col+GEMM and Winograd kernels. Using the Gem5 simulator we explore the impact of several hardware microarchitectural features including (i) vector lanes, (ii) vector lengths, (iii) cache sizes, and (iv) options for integrating the vector unit into the CPU pipeline. In the context of im2col+GEMM, we study the impact of several BLIS-like algorithmic optimizations such as (1) utilization of vector registers, (2) loop unrolling, (3) loop reorder, (4) manual vectorization, (5) prefetching, and (6) packing of matrices, on the RISC-V Vector Extension and ARM-SVE ISAs. We use the YOLOv3 and VGG16 network models for our evaluation. Our co-design study shows that BLIS-like optimizations are not beneficial to all types of vector microarchitectures. We additionally demonstrate that longer vector lengths (of at least 8192 bits) and larger caches (of 256MB) can boost performance by 5×, with our optimized CNN kernels, compared to a vector length of 512-bit and 1MB of L2 cache. In the context of Winograd, we present our novel approach of inter-tile parallelization across the input/output channels by using 8×8 tiles per channel to vectorize the algorithm on vector length agnostic (VLA) architectures. Our method exploits longer vector lengths and offers high memory reuse, resulting in performance improvement of up to 2.4× for non-strided convolutional layers with 3×3 kernel size, compared to our optimized im2col+GEMM approach on the Fujitsu A64FX processor. Our co-design study furthermore reveals that Winograd requires smaller cache sizes (up to 64MB) compared to im2col+GEMM.

基于cpu的推理可以部署为片外加速器的替代方案。在这种情况下，新兴的矢量架构是一个很有前途的选择，因为它们的效率很高。然而，卷积算法和硬件实现的巨大设计空间使得设计选项的选择具有挑战性。在本文中，我们介绍了我们正在进行的研究，以im2col+GEMM和Winograd内核为重点，共同设计基于cpu的卷积神经网络(CNN)推理的未来向量架构。使用Gem5模拟器，我们探索了几个硬件微架构特性的影响，包括(i)矢量通道，(ii)矢量长度，(iii)缓存大小，以及(iv)将矢量单元集成到CPU管道中的选项。在im2col+GEMM的背景下，我们研究了几种类似blis的算法优化，如(1)向量寄存器的利用，(2)循环展开，(3)循环重新排序，(4)手动向量化，(5)预取和(6)矩阵打包，对RISC-V向量扩展和ARM-SVE isa的影响。我们使用YOLOv3和VGG16网络模型进行评估。我们的协同设计研究表明，类blis优化并不适用于所有类型的矢量微架构。我们还证明，与512位向量长度和1MB L2缓存相比，使用优化的CNN内核，更长的向量长度(至少8192位)和更大的缓存(256MB)可以将性能提高5倍。在Winograd的背景下，我们提出了跨输入/输出通道的层间并行化的新方法，通过使用8×8每个通道的层对向量长度不可知(VLA)架构上的算法进行矢量化。与我们在富士通A64FX处理器上优化的im2col+GEMM方法相比，我们的方法利用更长的向量长度并提供高内存重用，导致具有3×3内核大小的非跨行卷积层的性能提高高达2.4倍。我们的共同设计研究进一步表明，与im2col+GEMM相比，Winograd需要更小的缓存大小(最多64MB)。

{"title":"Accelerating CNN inference on long vector architectures via co-design","authors":"Sonia Gupta, Nikela Papadopoulou, M. Pericàs","doi":"10.1109/IPDPS54959.2023.00024","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00024","url":null,"abstract":"CPU-based inference can be deployed as an alternative to off-chip accelerators. In this context, emerging vector architectures are a promising option, owing to their high efficiency. Yet the large design space of convolutional algorithms and hardware implementations makes the selection of design options challenging. In this paper, we present our ongoing research into co-designing future vector architectures for CPU-based Convolutional Neural Networks (CNN) inference focusing on the im2col+GEMM and Winograd kernels. Using the Gem5 simulator we explore the impact of several hardware microarchitectural features including (i) vector lanes, (ii) vector lengths, (iii) cache sizes, and (iv) options for integrating the vector unit into the CPU pipeline. In the context of im2col+GEMM, we study the impact of several BLIS-like algorithmic optimizations such as (1) utilization of vector registers, (2) loop unrolling, (3) loop reorder, (4) manual vectorization, (5) prefetching, and (6) packing of matrices, on the RISC-V Vector Extension and ARM-SVE ISAs. We use the YOLOv3 and VGG16 network models for our evaluation. Our co-design study shows that BLIS-like optimizations are not beneficial to all types of vector microarchitectures. We additionally demonstrate that longer vector lengths (of at least 8192 bits) and larger caches (of 256MB) can boost performance by 5×, with our optimized CNN kernels, compared to a vector length of 512-bit and 1MB of L2 cache. In the context of Winograd, we present our novel approach of inter-tile parallelization across the input/output channels by using 8×8 tiles per channel to vectorize the algorithm on vector length agnostic (VLA) architectures. Our method exploits longer vector lengths and offers high memory reuse, resulting in performance improvement of up to 2.4× for non-strided convolutional layers with 3×3 kernel size, compared to our optimized im2col+GEMM approach on the Fujitsu A64FX processor. Our co-design study furthermore reveals that Winograd requires smaller cache sizes (up to 64MB) compared to im2col+GEMM.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127961545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs ByteTransformer:为可变长度输入增强的高性能变压器

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2022-10-06 DOI: 10.1109/IPDPS54959.2023.00042

Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu

Transformers have become keystone models in natural language processing over the past decade. They have achieved great popularity in deep learning applications, but the increasing sizes of the parameter spaces required by transformer models generate a commensurate need to accelerate performance. Natural language processing problems are also routinely faced with variable-length sequences, as word counts commonly vary among sentences. Existing deep learning frameworks pad variable-length sequences to a maximal length, which adds significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a padding-free algorithm that liberates the entire transformer from redundant computations on zero padded tokens. In addition to algorithmic-level optimization, we provide architecture-aware optimizations for transformer functional modules, especially the performance-critical algorithm Multi-Head Attention (MHA). Experimental results on an NVIDIA A100 GPU with variable-length sequence inputs validate that our fused MHA outperforms PyTorch by 6.13x. The end-to-end performance of ByteTransformer for a forward BERT transformer surpasses state-of-the-art transformer frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer, Microsoft DeepSpeed-Inference and NVIDIA FasterTransformer, by 87%, 131%, 138%, 74% and 55%, respectively. We also demonstrate the general applicability of our optimization methods to other BERT-like models, including ALBERT, DistilBERT, and DeBERTa.

在过去的十年里，变形金刚已经成为自然语言处理的基石模型。它们在深度学习应用中非常受欢迎，但是变压器模型所需的参数空间尺寸的增加产生了相应的加速性能的需求。自然语言处理问题也经常面临变长度序列，因为句子中的字数通常不同。现有的深度学习框架将可变长度序列填充到最大长度，这增加了显着的内存和计算开销。在本文中，我们提出了一种高性能变压器ByteTransformer，用于可变长度输入。我们提出了一种无填充算法，将整个变压器从零填充令牌的冗余计算中解放出来。除了算法级优化之外，我们还为变压器功能模块提供架构感知优化，特别是性能关键算法多头注意(MHA)。在具有可变长度序列输入的NVIDIA A100 GPU上的实验结果验证了我们的融合MHA优于PyTorch 6.13倍。ByteTransformer用于前向BERT变压器的端到端性能分别超过了最先进的变压器框架，如PyTorch JIT, TensorFlow XLA，腾讯TurboTransformer，微软DeepSpeed-Inference和NVIDIA FasterTransformer，分别为87%，131%，138%，74%和55%。我们还演示了我们的优化方法对其他类bert模型的一般适用性，包括ALBERT、DistilBERT和DeBERTa。

{"title":"ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs","authors":"Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu","doi":"10.1109/IPDPS54959.2023.00042","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00042","url":null,"abstract":"Transformers have become keystone models in natural language processing over the past decade. They have achieved great popularity in deep learning applications, but the increasing sizes of the parameter spaces required by transformer models generate a commensurate need to accelerate performance. Natural language processing problems are also routinely faced with variable-length sequences, as word counts commonly vary among sentences. Existing deep learning frameworks pad variable-length sequences to a maximal length, which adds significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a padding-free algorithm that liberates the entire transformer from redundant computations on zero padded tokens. In addition to algorithmic-level optimization, we provide architecture-aware optimizations for transformer functional modules, especially the performance-critical algorithm Multi-Head Attention (MHA). Experimental results on an NVIDIA A100 GPU with variable-length sequence inputs validate that our fused MHA outperforms PyTorch by 6.13x. The end-to-end performance of ByteTransformer for a forward BERT transformer surpasses state-of-the-art transformer frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer, Microsoft DeepSpeed-Inference and NVIDIA FasterTransformer, by 87%, 131%, 138%, 74% and 55%, respectively. We also demonstrate the general applicability of our optimization methods to other BERT-like models, including ALBERT, DistilBERT, and DeBERTa.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115739342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Stochastic Neuromorphic Circuits for Solving MAXCUT 求解MAXCUT的随机神经形态电路

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2022-10-05 DOI: 10.1109/IPDPS54959.2023.00083

Bradley H. Theilman, Yipu Wang, Ojas D. Parekh, William M. Severa, J. D. Smith, J. Aimone

Finding the maximum cut of a graph (MAXCUT) is a classic optimization problem that has motivated parallel algorithm development. While approximate algorithms to MAXCUT offer attractive theoretical guarantees and demonstrate compelling empirical performance, such approximation approaches can shift the dominant computational cost to the stochastic sampling operations. Neuromorphic computing, which uses the organizing principles of the nervous system to inspire new parallel computing architectures, offers a possible solution. One ubiquitous feature of natural brains is stochasticity: the individual elements of biological neural networks possess an intrinsic randomness that serves as a resource enabling their unique computational capacities. By designing circuits and algorithms that make use of randomness similarly to natural brains, we hypothesize that the intrinsic randomness in microelectronics devices could be turned into a valuable component of a neuromorphic architecture enabling more efficient computations. Here, we present neuromorphic circuits that transform the stochastic behavior of a pool of random devices into useful correlations that drive stochastic solutions to MAXCUT. We show that these circuits perform favorably in comparison to software solvers and argue that this neuromorphic hardware implementation provides a path for scaling advantages. This work demonstrates the utility of combining neuromorphic principles with intrinsic randomness as a computational resource for new computational architectures.

寻找图的最大截点(MAXCUT)是一个经典的优化问题，它激发了并行算法的发展。虽然MAXCUT的近似算法提供了有吸引力的理论保证，并展示了令人信服的经验性能，但这种近似方法可以将主要的计算成本转移到随机抽样操作上。神经形态计算(Neuromorphic computing)提供了一种可能的解决方案，它利用神经系统的组织原理来激发新的并行计算架构。自然大脑的一个普遍特征是随机性:生物神经网络的单个元素具有内在的随机性，作为一种资源，使其具有独特的计算能力。通过设计类似于自然大脑的随机性的电路和算法，我们假设微电子设备中固有的随机性可以转化为神经形态架构的一个有价值的组成部分，从而实现更高效的计算。在这里，我们提出了神经形态电路，将随机设备池的随机行为转化为有用的相关性，从而驱动MAXCUT的随机解。我们表明，与软件求解器相比，这些电路表现良好，并认为这种神经形态硬件实现为扩展优势提供了一条途径。这项工作证明了将神经形态原理与内在随机性结合起来作为新计算架构的计算资源的实用性。

{"title":"Stochastic Neuromorphic Circuits for Solving MAXCUT","authors":"Bradley H. Theilman, Yipu Wang, Ojas D. Parekh, William M. Severa, J. D. Smith, J. Aimone","doi":"10.1109/IPDPS54959.2023.00083","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00083","url":null,"abstract":"Finding the maximum cut of a graph (MAXCUT) is a classic optimization problem that has motivated parallel algorithm development. While approximate algorithms to MAXCUT offer attractive theoretical guarantees and demonstrate compelling empirical performance, such approximation approaches can shift the dominant computational cost to the stochastic sampling operations. Neuromorphic computing, which uses the organizing principles of the nervous system to inspire new parallel computing architectures, offers a possible solution. One ubiquitous feature of natural brains is stochasticity: the individual elements of biological neural networks possess an intrinsic randomness that serves as a resource enabling their unique computational capacities. By designing circuits and algorithms that make use of randomness similarly to natural brains, we hypothesize that the intrinsic randomness in microelectronics devices could be turned into a valuable component of a neuromorphic architecture enabling more efficient computations. Here, we present neuromorphic circuits that transform the stochastic behavior of a pool of random devices into useful correlations that drive stochastic solutions to MAXCUT. We show that these circuits perform favorably in comparison to software solvers and argue that this neuromorphic hardware implementation provides a path for scaling advantages. This work demonstrates the utility of combining neuromorphic principles with intrinsic randomness as a computational resource for new computational architectures.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132785352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Scheduling with Many Shared Resources 使用多个共享资源调度

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2022-10-04 DOI: 10.1109/IPDPS54959.2023.00049

Max A. Deppert, K. Jansen, M. Maack, Simon Pukrop, M. Rau

Consider the many shared resources scheduling problem where jobs have to be scheduled on identical parallel machines with the goal of minimizing the makespan. However, each job needs exactly one additional shared resource in order to be executed and hence prevents the execution of jobs that need the same resource while being processed. Previously, an approximation ratio of asymptotically 2 was the best known result for this problem. Furthermore, a 6/5-approximation for the case with only two machines was known as well as a PTAS for the case with a constant number of machines. We present a simple and fast 5/3-approximation and a much more involved but still reasonable 1.5-approximation. Furthermore, we provide a PTAS for the case with only a constant number of machines, which is arguably simpler and faster than the previously known one, as well as a PTAS with resource augmentation for the general case. The approximation schemes make use of the N-fold integer programming machinery, which has found more and more applications in the field of scheduling recently. It is plausible that the latter results can be improved and extended to more general cases. Lastly, we give an inapproximability result for the natural problem extension where each job may need up to a constant number of different resources, namely 3, ruling out better than 5/4 approximations for that case.

考虑许多共享资源调度问题，其中作业必须在相同的并行机器上调度，目标是最小化完工时间。但是，每个作业需要一个额外的共享资源才能执行，因此可以防止在处理过程中需要相同资源的作业执行。以前，对于这个问题，最著名的结果是渐近的近似比为2。此外，对于只有两台机器的情况，已知6/5近似，对于机器数量恒定的情况，也知道PTAS。我们提出了一个简单而快速的5/3近似值和一个更复杂但仍然合理的1.5近似值。此外，我们为只有恒定数量的机器的情况提供了一个PTAS，这可以说比以前已知的情况更简单和更快，并且为一般情况提供了一个具有资源增强的PTAS。该逼近方案利用了n重整数规划机制，近年来在调度领域得到了越来越多的应用。后一种结果可以改进并推广到更一般的情况，这似乎是合理的。最后，我们给出了自然问题扩展的不可逼近性结果，其中每个作业可能需要多达常数个不同的资源，即3个，排除了这种情况下优于5/4的近似。

{"title":"Scheduling with Many Shared Resources","authors":"Max A. Deppert, K. Jansen, M. Maack, Simon Pukrop, M. Rau","doi":"10.1109/IPDPS54959.2023.00049","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00049","url":null,"abstract":"Consider the many shared resources scheduling problem where jobs have to be scheduled on identical parallel machines with the goal of minimizing the makespan. However, each job needs exactly one additional shared resource in order to be executed and hence prevents the execution of jobs that need the same resource while being processed. Previously, an approximation ratio of asymptotically 2 was the best known result for this problem. Furthermore, a 6/5-approximation for the case with only two machines was known as well as a PTAS for the case with a constant number of machines. We present a simple and fast 5/3-approximation and a much more involved but still reasonable 1.5-approximation. Furthermore, we provide a PTAS for the case with only a constant number of machines, which is arguably simpler and faster than the previously known one, as well as a PTAS with resource augmentation for the general case. The approximation schemes make use of the N-fold integer programming machinery, which has found more and more applications in the field of scheduling recently. It is plausible that the latter results can be improved and extended to more general cases. Lastly, we give an inapproximability result for the natural problem extension where each job may need up to a constant number of different resources, namely 3, ruling out better than 5/4 approximations for that case.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116651087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

qTask: Task-parallel Quantum Circuit Simulation with Incrementality 基于增量的任务并行量子电路仿真

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2022-10-03 DOI: 10.1109/IPDPS54959.2023.00080

Tsung-Wei Huang

Incremental quantum circuit simulation has emerged as an important tool for simulation-driven quantum applications, such as circuit synthesis, verification, and analysis. When a small portion of the circuit is modified, the simulator must incrementally update state amplitudes for reasonable turnaround time and productivity. However, this type of incrementality has been largely ignored by existing research. To fill this gap, we introduce a new incremental quantum circuit simulator called qTask. qTask leverages a task-parallel decomposition strategy to explore both inter- and intra-gate operation parallelisms from partitioned data blocks. Our partitioning strategy effectively narrows down incremental update to a small set of partitions affected by circuit modifiers. We have demonstrated the promising performance of qTask on QASMBench benchmarks. Compared to two state-of-the-art simulators, Qulacs and Qiskit, qTask is respectively 1.46 × and 1.71× faster for full simulation and 5.77× and 9.76× faster for incremental simulation.

增量量子电路仿真已成为仿真驱动量子应用的重要工具，如电路合成、验证和分析。当电路的一小部分被修改时，模拟器必须增量地更新状态振幅，以获得合理的周转时间和生产率。然而，这种类型的增量在很大程度上被现有的研究所忽视。为了填补这一空白，我们引入了一个新的增量量子电路模拟器，称为qTask。qTask利用任务并行分解策略来探索来自分区数据块的门间和门内操作的并行性。我们的分区策略有效地将增量更新缩小到受电路修改器影响的一小部分分区。我们已经在QASMBench基准测试上展示了qTask的良好性能。与Qulacs和Qiskit这两个最先进的模拟器相比，qTask在完全模拟时分别快了1.46倍和1.71倍，在增量模拟时分别快了5.77倍和9.76倍。

引用次数: 0

Scalable adaptive algorithms for next-generation multiphase flow simulations 下一代多相流模拟的可扩展自适应算法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2022-09-25 DOI: 10.1109/IPDPS54959.2023.00065

K. Saurabh, Masado Ishii, Makrand A. Khanwale, H. Sundar, B. Ganapathysubramanian

High-fidelity flow simulations are indispensable when analyzing systems exhibiting multiphase flow phenomena. The accuracy of multiphase flow simulations is strongly contingent upon the finest mesh resolution used to represent the fluid-fluid interfaces. However, the increased resolution comes at a higher computational cost. In this work, we propose algorithmic advances that aim to reduce the computational cost without compromising on the physics by selectively detecting key regions of interest (droplets/filaments) that require significantly higher resolution. The framework uses an adaptive octree–based meshing framework that is integrated with PETSc’s linear algebra solvers. We demonstrate scaling of the framework up to 114,688 processes on TACC’s Frontera. Finally, we deploy the framework to simulate one of the most resolved simulations of primary jet atomization. This simulation – equivalent to 35 trillion grid points on a uniform grid – is 64× larger than current state–of–the–art simulations and provides unprecedented insights into an important flow physics problem with a diverse array of engineering applications.

在分析具有多相流动现象的系统时，高保真的流动模拟是必不可少的。多相流模拟的准确性很大程度上取决于用于表示流体-流体界面的最佳网格分辨率。然而，提高的分辨率带来了更高的计算成本。在这项工作中，我们提出了算法的进步，旨在通过选择性地检测需要更高分辨率的关键区域(液滴/细丝)来降低计算成本，同时不影响物理特性。该框架使用了一个基于八叉树的自适应网格框架，该框架与PETSc的线性代数求解器相结合。我们在TACC的Frontera上演示了将框架扩展到114,688个进程。最后，我们部署了该框架来模拟一次射流雾化的最精确的模拟之一。该模拟相当于统一网格上的35万亿个网格点，比目前最先进的模拟大64倍，并为各种工程应用的重要流动物理问题提供了前所未有的见解。

{"title":"Scalable adaptive algorithms for next-generation multiphase flow simulations","authors":"K. Saurabh, Masado Ishii, Makrand A. Khanwale, H. Sundar, B. Ganapathysubramanian","doi":"10.1109/IPDPS54959.2023.00065","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00065","url":null,"abstract":"High-fidelity flow simulations are indispensable when analyzing systems exhibiting multiphase flow phenomena. The accuracy of multiphase flow simulations is strongly contingent upon the finest mesh resolution used to represent the fluid-fluid interfaces. However, the increased resolution comes at a higher computational cost. In this work, we propose algorithmic advances that aim to reduce the computational cost without compromising on the physics by selectively detecting key regions of interest (droplets/filaments) that require significantly higher resolution. The framework uses an adaptive octree–based meshing framework that is integrated with PETSc’s linear algebra solvers. We demonstrate scaling of the framework up to 114,688 processes on TACC’s Frontera. Finally, we deploy the framework to simulate one of the most resolved simulations of primary jet atomization. This simulation – equivalent to 35 trillion grid points on a uniform grid – is 64× larger than current state–of–the–art simulations and provides unprecedented insights into an important flow physics problem with a diverse array of engineering applications.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116004631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0