ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing最新文献_第10页

AUTO-PRUNE: automated DNN pruning and mapping for ReRAM-based accelerator AUTO-PRUNE:基于reram的加速器的自动DNN修剪和映射

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460366

Siling Yang, Weijian Chen, Xuechen Zhang, Shuibing He, Yanlong Yin, Xian-He Sun

Emergent ReRAM-based accelerators support in-memory computation to accelerate deep neural network (DNN) inference. Weight matrix pruning of DNNs is a widely used technique to reduce the size of DNN models, thereby reducing the resource and energy consumption of ReRAM-based accelerators. However, conventional works on weight matrix pruning for ReRAM-based accelerators have three major issues. First, they use heuristics or rules from domain experts to prune the weights, leading to suboptimal pruning policies. Second, they mostly focus on improving compression ratio, thus may not meet accuracy constraints. Third, they ignore direct feedback of hardware. In this paper, we introduce an automated DNN pruning and mapping framework, named AUTO-PRUNE. It leverages reinforcement learning (RL) to automatically determine the pruning policy considering the constraint of accuracy loss. The reward function of RL agents is designed using hardware’s direct feedback (i.e., accuracy and compression rate of occupied crossbars). The function directs the search of the pruning ratio of each layer for a global optimum considering the characteristics of individual layers of DNN models. Then AUTO-PRUNE maps the pruned weight matrices to crossbars to store only nontrivial elements. Finally, to avoid the dislocation problem, we design a new data-path in ReRAM-based accelerators to correctly index and feed input to matrix-vector computation leveraging the mechanism of operation units. Experimental results show that, compared to the state-of-the-art work, AUTO-PRUNE achieves up to 3.3X compression rate, 3.1X area efficiency, and 3.3X energy efficiency with a similar or even higher accuracy.

紧急基于reram的加速器支持内存计算，以加速深度神经网络(DNN)的推理。DNN的权矩阵剪枝是一种广泛使用的技术，可以减小DNN模型的尺寸，从而减少基于reram的加速器的资源和能量消耗。然而，传统的基于reram的加速器权矩阵修剪工作存在三个主要问题。首先，他们使用启发式或领域专家的规则来修剪权重，导致次优修剪策略。其次，它们主要关注于提高压缩比，因此可能不满足精度约束。第三，他们忽略了硬件的直接反馈。在本文中，我们介绍了一个自动DNN修剪和映射框架，称为AUTO-PRUNE。它利用强化学习(RL)在考虑精度损失约束的情况下自动确定剪枝策略。RL agent的奖励函数是利用硬件的直接反馈(即占用横梁的准确率和压缩率)来设计的。考虑到DNN模型各层的特征，该函数指导搜索每层的剪枝比以获得全局最优。然后AUTO-PRUNE将修剪后的权重矩阵映射到交叉栏，以仅存储非平凡元素。最后，为了避免位错问题，我们在基于rerram的加速器中设计了一种新的数据路径，利用运算单元的机制，正确地索引和馈送输入到矩阵向量计算中。实验结果表明，与目前的工作相比，AUTO-PRUNE实现了高达3.3倍的压缩率，3.1倍的面积效率和3.3倍的能量效率，并且具有相似甚至更高的精度。

{"title":"AUTO-PRUNE: automated DNN pruning and mapping for ReRAM-based accelerator","authors":"Siling Yang, Weijian Chen, Xuechen Zhang, Shuibing He, Yanlong Yin, Xian-He Sun","doi":"10.1145/3447818.3460366","DOIUrl":"https://doi.org/10.1145/3447818.3460366","url":null,"abstract":"Emergent ReRAM-based accelerators support in-memory computation to accelerate deep neural network (DNN) inference. Weight matrix pruning of DNNs is a widely used technique to reduce the size of DNN models, thereby reducing the resource and energy consumption of ReRAM-based accelerators. However, conventional works on weight matrix pruning for ReRAM-based accelerators have three major issues. First, they use heuristics or rules from domain experts to prune the weights, leading to suboptimal pruning policies. Second, they mostly focus on improving compression ratio, thus may not meet accuracy constraints. Third, they ignore direct feedback of hardware. In this paper, we introduce an automated DNN pruning and mapping framework, named AUTO-PRUNE. It leverages reinforcement learning (RL) to automatically determine the pruning policy considering the constraint of accuracy loss. The reward function of RL agents is designed using hardware’s direct feedback (i.e., accuracy and compression rate of occupied crossbars). The function directs the search of the pruning ratio of each layer for a global optimum considering the characteristics of individual layers of DNN models. Then AUTO-PRUNE maps the pruned weight matrices to crossbars to store only nontrivial elements. Finally, to avoid the dislocation problem, we design a new data-path in ReRAM-based accelerators to correctly index and feed input to matrix-vector computation leveraging the mechanism of operation units. Experimental results show that, compared to the state-of-the-art work, AUTO-PRUNE achieves up to 3.3X compression rate, 3.1X area efficiency, and 3.3X energy efficiency with a similar or even higher accuracy.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72946265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators 在GPU-FPGA混合加速器上实现高能效DNN训练

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460371

Xin He, Jiawen Liu, Zhen Xie, Hao Chen, Guoyang Chen, Weifeng Zhang, Dong Li

DNN training consumes orders of magnitude more energy than inference and requires innovative use of accelerators to improve energy-efficiency. However, despite having complementary features, GPUs and FPGAs have been mostly used independently for the entire training process, thus neglecting the opportunity in assigning individual but distinct operations to the most suitable hardware. In this paper, we take the initiative to explore new opportunities and viable solutions in enabling energy-efficient DNN training on hybrid accelerators. To overcome fundamental challenges including avoiding training throughput loss, enabling fast design space exploration, and efficient scheduling, we propose a comprehensive framework, Hype-training, that utilizes a combination of offline characterization, performance modeling, and online scheduling of individual operations. Experimental tests using NVIDIA V100 GPUs and Intel Stratix 10 FPGAs show that, Hype-training is able to exploit a mixture of GPUs and FPGAs at a fine granularity to achieve significant energy reduction, by 44.3% on average and up to 59.7%, without any loss in training throughput. Hype-training can also enforce power caps more effectively than state-of-the-art power management mechanisms on GPUs.

深度神经网络训练比推理消耗更多的能量，需要创新地使用加速器来提高能源效率。然而，尽管具有互补的功能，gpu和fpga在整个训练过程中大多是独立使用的，从而忽略了将单个但不同的操作分配给最合适的硬件的机会。在本文中，我们主动探索在混合加速器上实现节能DNN训练的新机会和可行的解决方案。为了克服基本挑战，包括避免训练吞吐量损失，实现快速设计空间探索和高效调度，我们提出了一个综合框架，hyper -training，它结合了离线表征，性能建模和单个操作的在线调度。使用NVIDIA V100 gpu和Intel Stratix 10 fpga进行的实验测试表明，hyper -training能够在细粒度上利用gpu和fpga的混合来实现显著的能量降低，平均降低44.3%，最高可达59.7%，而训练吞吐量没有任何损失。与gpu上最先进的电源管理机制相比，宣传训练还可以更有效地强制执行电源上限。

{"title":"Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators","authors":"Xin He, Jiawen Liu, Zhen Xie, Hao Chen, Guoyang Chen, Weifeng Zhang, Dong Li","doi":"10.1145/3447818.3460371","DOIUrl":"https://doi.org/10.1145/3447818.3460371","url":null,"abstract":"DNN training consumes orders of magnitude more energy than inference and requires innovative use of accelerators to improve energy-efficiency. However, despite having complementary features, GPUs and FPGAs have been mostly used independently for the entire training process, thus neglecting the opportunity in assigning individual but distinct operations to the most suitable hardware. In this paper, we take the initiative to explore new opportunities and viable solutions in enabling energy-efficient DNN training on hybrid accelerators. To overcome fundamental challenges including avoiding training throughput loss, enabling fast design space exploration, and efficient scheduling, we propose a comprehensive framework, Hype-training, that utilizes a combination of offline characterization, performance modeling, and online scheduling of individual operations. Experimental tests using NVIDIA V100 GPUs and Intel Stratix 10 FPGAs show that, Hype-training is able to exploit a mixture of GPUs and FPGAs at a fine granularity to achieve significant energy reduction, by 44.3% on average and up to 59.7%, without any loss in training throughput. Hype-training can also enforce power caps more effectively than state-of-the-art power management mechanisms on GPUs.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74467396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

PSSM PSSM

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460374

Shougang Yuan, Yan Solihin, Huiyang Zhou

In this paper, we investigate the secure memory architecture for GPUs and point out that conventional CPU secure memory architecture can not be directly adopted to the GPUs. The key reasons include: (1) accessing the security metadata, including encryption counters, message authentication codes (MACs) and integrity trees, requires significant memory bandwidth, which may lead to severe bandwidth competition with normal data accesses and degrade the GPU performance; (2) contemporary GPUs use partitioned memory organization, which results in storage and coherence problems for encryption counters and integrity trees since different partitions may need to update the same counter/integrity tree blocks; and (3) the existing split-counter block organization is not friendly to sectored caches, which are commonly used in GPU for bandwidth savings. Based on these observations, we propose partitioned and sectored security metadata (PSSM), which has two components: (a) using the offset addresses (referred to as local addresses) within each partition, instead of the virtual or physical addresses, to generate the metadata so as to solve the counter or integrity tree storage and coherence problem and (b) reorganizing the security metadata to make them friendly to the sectored cache structure so as to reduce the memory bandwidth consumption of metadata accesses. With these proposed schemes, the performance overhead of secure GPU memory is reduced from 59.22% to 16.84% on average. If only memory encryption is required, the performance overhead is reduced from 29.53% to 5.18%.

{"title":"PSSM","authors":"Shougang Yuan, Yan Solihin, Huiyang Zhou","doi":"10.1145/3447818.3460374","DOIUrl":"https://doi.org/10.1145/3447818.3460374","url":null,"abstract":"In this paper, we investigate the secure memory architecture for GPUs and point out that conventional CPU secure memory architecture can not be directly adopted to the GPUs. The key reasons include: (1) accessing the security metadata, including encryption counters, message authentication codes (MACs) and integrity trees, requires significant memory bandwidth, which may lead to severe bandwidth competition with normal data accesses and degrade the GPU performance; (2) contemporary GPUs use partitioned memory organization, which results in storage and coherence problems for encryption counters and integrity trees since different partitions may need to update the same counter/integrity tree blocks; and (3) the existing split-counter block organization is not friendly to sectored caches, which are commonly used in GPU for bandwidth savings. Based on these observations, we propose partitioned and sectored security metadata (PSSM), which has two components: (a) using the offset addresses (referred to as local addresses) within each partition, instead of the virtual or physical addresses, to generate the metadata so as to solve the counter or integrity tree storage and coherence problem and (b) reorganizing the security metadata to make them friendly to the sectored cache structure so as to reduce the memory bandwidth consumption of metadata accesses. With these proposed schemes, the performance overhead of secure GPU memory is reduced from 59.22% to 16.84% on average. If only memory encryption is required, the performance overhead is reduced from 29.53% to 5.18%.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"140 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91459677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

FULL-W2V: fully exploiting data reuse for W2V on GPU-accelerated systems FULL-W2V:在gpu加速系统上充分利用W2V的数据重用

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460373

Thomas Randall, Tyler N. Allen, Rong Ge

Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm’s inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture’s peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.

Word2Vec是自然语言处理(NLP)领域最具影响力的创新之一，它用低维的密集向量表示人类文本中潜在的语法和句法信息。由于Word2Vec算法固有的顺序性、密集的内存访问以及它所代表的大量词汇表，它的计算成本很高。虽然先前的研究已经调查了探索并行性和提高内存系统性能的技术，但它们很难在强大的gpu上有效地获得吞吐量。我们将内存数据访问和延迟确定为gpu先前工作的主要瓶颈，这阻碍了高度优化的内核达到架构的峰值性能。我们提出了一种新的算法，FULL-W2V，它最大限度地利用了W2V算法中数据重用的机会，并利用GPU架构和资源来减少对低内存级别的访问并改善时间局部性。与之前最先进的GPU实现相比，FULL-W2V能够显著减少对GPU全局内存的访问，例如减少89%以上，从而在连续几代硬件上实现显著的性能提升。当从Nvidia Pascal P100移植到Volta V100卡时，我们的原型实现实现了2.97倍的加速，并且在相同嵌入质量的V100卡上比最先进的速度高出5.72倍。深入分析表明，通过寄存器和共享内存缓存减少内存访问以及高吞吐量共享内存减少可以显著提高算法强度。FULL-W2V可以为自然语言处理和其他领域的许多应用带来潜在的好处。

{"title":"FULL-W2V: fully exploiting data reuse for W2V on GPU-accelerated systems","authors":"Thomas Randall, Tyler N. Allen, Rong Ge","doi":"10.1145/3447818.3460373","DOIUrl":"https://doi.org/10.1145/3447818.3460373","url":null,"abstract":"Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm’s inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture’s peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76971941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ThundeRiNG: generating multiple independent random number sequences on FPGAs ThundeRiNG:在fpga上生成多个独立随机数序列

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-05-20 DOI: 10.1145/3447818.3461664

Hongshi Tan, Xinyu Chen, Yao Chen, Bingsheng He, W. Wong

In this paper, we propose ThundeRiNG, a resource-efficient and high-throughput system for generating multiple independent sequences of random numbers (MISRN) on FPGAs. Generating MISRN can be a time-consuming step in many applications such as numeric computation and approximate computing. Despite that decades of studies on generating a single sequence of random numbers on FPGAs have achieved very high throughput and high quality of randomness, existing MISRN approaches either suffer from heavy resource consumption or fail to achieve statistical independence among sequences. In contrast, ThundeRiNG resolves the dependence by using a resource-efficient decorrelator among multiple sequences, guaranteeing a high statistical quality of randomness. Moreover, ThundeRiNG develops a novel state sharing among a massive number of pseudo-random number generator instances on FPGAs. The experimental results show that ThundeRiNG successfully passes the widely used statistical test, TestU01, only consumes a constant number of DSPs (less than 1% of the FPGA resource capacity) for generating any number of sequences, and achieves a throughput of 655 billion random numbers per second. Compared to the state-of-the-art GPU library, ThundeRiNG demonstrates a 10.62x speedup on MISRN and delivers up to 9.15x performance and 26.63x power efficiency improvement on two applications (pi estimation and Monte Carlo option pricing). This work is open-sourced on Github at https://github.com/Xtra-Computing/ThundeRiNG.

在本文中，我们提出了ThundeRiNG，一个资源高效和高吞吐量的系统，用于在fpga上生成多个独立随机数序列(MISRN)。在许多应用程序中，如数值计算和近似计算，生成MISRN可能是一个耗时的步骤。尽管几十年来在fpga上生成单个随机数序列的研究已经实现了非常高的吞吐量和高质量的随机性，但现有的MISRN方法要么消耗大量资源，要么无法实现序列之间的统计独立性。相比之下，ThundeRiNG通过在多个序列之间使用资源高效的去相关器来解决依赖性，保证了随机性的高统计质量。此外，ThundeRiNG还在fpga上开发了大量伪随机数生成器实例之间的状态共享。实验结果表明，ThundeRiNG成功通过了广泛使用的统计测试TestU01，生成任意数量的序列仅消耗恒定数量的dsp(小于FPGA资源容量的1%)，实现了每秒6550亿个随机数的吞吐量。与最先进的GPU库相比，ThundeRiNG在MISRN上的速度提高了10.62倍，在两个应用程序(pi估计和蒙特卡罗期权定价)上的性能提高了9.15倍，能效提高了26.63倍。这项工作是在Github上开源的https://github.com/Xtra-Computing/ThundeRiNG。

{"title":"ThundeRiNG: generating multiple independent random number sequences on FPGAs","authors":"Hongshi Tan, Xinyu Chen, Yao Chen, Bingsheng He, W. Wong","doi":"10.1145/3447818.3461664","DOIUrl":"https://doi.org/10.1145/3447818.3461664","url":null,"abstract":"In this paper, we propose ThundeRiNG, a resource-efficient and high-throughput system for generating multiple independent sequences of random numbers (MISRN) on FPGAs. Generating MISRN can be a time-consuming step in many applications such as numeric computation and approximate computing. Despite that decades of studies on generating a single sequence of random numbers on FPGAs have achieved very high throughput and high quality of randomness, existing MISRN approaches either suffer from heavy resource consumption or fail to achieve statistical independence among sequences. In contrast, ThundeRiNG resolves the dependence by using a resource-efficient decorrelator among multiple sequences, guaranteeing a high statistical quality of randomness. Moreover, ThundeRiNG develops a novel state sharing among a massive number of pseudo-random number generator instances on FPGAs. The experimental results show that ThundeRiNG successfully passes the widely used statistical test, TestU01, only consumes a constant number of DSPs (less than 1% of the FPGA resource capacity) for generating any number of sequences, and achieves a throughput of 655 billion random numbers per second. Compared to the state-of-the-art GPU library, ThundeRiNG demonstrates a 10.62x speedup on MISRN and delivers up to 9.15x performance and 26.63x power efficiency improvement on two applications (pi estimation and Monte Carlo option pricing). This work is open-sourced on Github at https://github.com/Xtra-Computing/ThundeRiNG.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"109 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80689103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Performance portable back-projection algorithms on CPUs: agnostic data locality and vectorization optimizations cpu上的性能便携反投影算法:不可知的数据位置和向量化优化

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-04-27 DOI: 10.1145/3447818.3460353

Peng Chen, M. Wahib, Xiao Wang, Shin'ichiro Takizawa, Takahiro Hirofuchi, Hirotaka Ogawa, S. Matsuoka

Computed Tomography (CT) is a key 3D imaging technology that fundamentally relies on the compute-intense back-projection operation to generate 3D volumes. GPUs are typically used for back-projection in production CT devices. However, with the rise of power-constrained micro-CT devices, and also the emergence of CPUs comparable in performance to GPUs, back-projection for CPUs could become favorable. Unlike GPUs, extracting parallelism for back-projection algorithms on CPUs is complex given that parallelism and locality are not explicitly defined and controlled by the programmer, as is the case when using CUDA for instance. We propose a collection of novel back-projection algorithms that reduce the arithmetic computation, robustly enable vectorization, enforce a regular memory access pattern, and maximize the data locality. We also implement the novel algorithms as efficient back-projection kernels that are performance portable over a wide range of CPUs. Performance evaluation using a variety of CPUs from different vendors and generations demonstrates that our back-projection implementation achieves on average 5.2 times speedup over the multi-threaded implementation of the most widely used, and optimized, open library. With a state‐of‐the‐art CPU, we reach performance that rivals top-performing GPUs.

计算机断层扫描(CT)是一种关键的三维成像技术，它基本上依赖于计算密集型的反向投影操作来生成三维体。gpu通常用于生产CT设备的反向投影。然而，随着功耗受限的微型ct设备的兴起，以及与gpu性能相当的cpu的出现，cpu的反向投影可能会变得有利。与gpu不同，在cpu上提取反投影算法的并行性是复杂的，因为并行性和局部性不是由程序员明确定义和控制的，例如使用CUDA时就是这种情况。我们提出了一系列新的反投影算法，这些算法减少了算术计算，鲁棒地实现了向量化，强制执行了规则的内存访问模式，并最大限度地提高了数据的局域性。我们还将新算法实现为高效的反向投影内核，这些内核在各种cpu上具有性能可移植性。使用来自不同厂商和不同时代的各种cpu进行的性能评估表明，我们的反向投影实现比使用最广泛和优化的开放库的多线程实现平均提高5.2倍的速度。凭借最先进的CPU，我们达到了与顶级gpu相媲美的性能。

{"title":"Performance portable back-projection algorithms on CPUs: agnostic data locality and vectorization optimizations","authors":"Peng Chen, M. Wahib, Xiao Wang, Shin'ichiro Takizawa, Takahiro Hirofuchi, Hirotaka Ogawa, S. Matsuoka","doi":"10.1145/3447818.3460353","DOIUrl":"https://doi.org/10.1145/3447818.3460353","url":null,"abstract":"Computed Tomography (CT) is a key 3D imaging technology that fundamentally relies on the compute-intense back-projection operation to generate 3D volumes. GPUs are typically used for back-projection in production CT devices. However, with the rise of power-constrained micro-CT devices, and also the emergence of CPUs comparable in performance to GPUs, back-projection for CPUs could become favorable. Unlike GPUs, extracting parallelism for back-projection algorithms on CPUs is complex given that parallelism and locality are not explicitly defined and controlled by the programmer, as is the case when using CUDA for instance. We propose a collection of novel back-projection algorithms that reduce the arithmetic computation, robustly enable vectorization, enforce a regular memory access pattern, and maximize the data locality. We also implement the novel algorithms as efficient back-projection kernels that are performance portable over a wide range of CPUs. Performance evaluation using a variety of CPUs from different vendors and generations demonstrates that our back-projection implementation achieves on average 5.2 times speedup over the multi-threaded implementation of the most widely used, and optimized, open library. With a state‐of‐the‐art CPU, we reach performance that rivals top-performing GPUs.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72791736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Partitioning sparse deep neural networks for scalable training and inference 用于可扩展训练和推理的稀疏深度神经网络分区

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-04-23 DOI: 10.1145/3447818.3460372

G. Demirci, H. Ferhatosmanoğlu

The state-of-the-art deep neural networks (DNNs) have significant computational and data management requirements. The size of both training data and models continue to increase. Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs. The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning. Both the feedforward (inference) and backpropagation steps in stochastic gradient descent (SGD) algorithm for training sparse DNNs involve consecutive sparse matrix-vector multiplications (SpMVs). We first introduce a distributed-memory parallel SpMV-based solution for the SGD algorithm to improve its scalability. The parallelization approach is based on row-wise partitioning of weight matrices that represent neuron connections between consecutive layers. We then propose a novel hypergraph model for partitioning weight matrices to reduce the total communication volume and ensure computational load-balance among processors. Experiments performed on sparse DNNs demonstrate that the proposed solution is highly efficient and scalable. By utilizing the proposed matrix partitioning scheme, the performance of our solution is further improved significantly.

最先进的深度神经网络(dnn)具有显著的计算和数据管理要求。训练数据和模型的规模都在不断增加。稀疏化和修剪方法可以有效地去除dnn中的大部分连接。由此产生的稀疏网络对进一步提高深度学习中训练和推理的计算效率提出了独特的挑战。训练稀疏dnn的随机梯度下降(SGD)算法的前馈(推理)和反向传播步骤都涉及连续稀疏矩阵向量乘法(spmv)。我们首先为SGD算法引入了一个基于分布式内存并行spmv的解决方案，以提高其可伸缩性。并行化方法是基于表示连续层之间神经元连接的权重矩阵的逐行划分。然后，我们提出了一种新的超图模型来划分权重矩阵，以减少总通信量并确保处理器之间的计算负载平衡。在稀疏dnn上进行的实验表明，该方法具有较高的效率和可扩展性。利用所提出的矩阵划分方案，我们的解决方案的性能得到了进一步的显著提高。

{"title":"Partitioning sparse deep neural networks for scalable training and inference","authors":"G. Demirci, H. Ferhatosmanoğlu","doi":"10.1145/3447818.3460372","DOIUrl":"https://doi.org/10.1145/3447818.3460372","url":null,"abstract":"The state-of-the-art deep neural networks (DNNs) have significant computational and data management requirements. The size of both training data and models continue to increase. Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs. The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning. Both the feedforward (inference) and backpropagation steps in stochastic gradient descent (SGD) algorithm for training sparse DNNs involve consecutive sparse matrix-vector multiplications (SpMVs). We first introduce a distributed-memory parallel SpMV-based solution for the SGD algorithm to improve its scalability. The parallelization approach is based on row-wise partitioning of weight matrices that represent neuron connections between consecutive layers. We then propose a novel hypergraph model for partitioning weight matrices to reduce the total communication volume and ensure computational load-balance among processors. Experiments performed on sparse DNNs demonstrate that the proposed solution is highly efficient and scalable. By utilizing the proposed matrix partitioning scheme, the performance of our solution is further improved significantly.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86092569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

FT-BLAS: a high performance BLAS implementation with online fault tolerance FT-BLAS:具有在线容错功能的高性能BLAS实现

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-04-02 DOI: 10.1145/3447818.3460364

Yujia Zhai, Elisabeth Giem, Quan Fan, Kai Zhao, Jinyang Liu, Zizhong Chen

Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate fault tolerance into our brand-new BLAS implementation: duplicating computing instructions for memory-bound Level-1 and Level-2 BLAS routines and incorporating an Algorithm-Based Fault Tolerance mechanism for computing-bound Level-3 BLAS routines. Our high performance and low overhead are obtained from delicate assembly-level optimization and a kernel-fusion approach to the computing kernels. Experimental results demonstrate that FT-BLAS offers high reliability and high performance -- faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14% and 21.70%, respectively, for routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.

基本线性代数子程序(BLAS)是科学计算和机器学习的核心库。本文介绍了FT-BLAS，一种新的BLAS例程实现，它不仅可以容忍动态中的软错误，而且还提供了与广泛使用的处理器(如Intel Skylake和Cascade Lake)上的现代最先进的BLAS库相当的性能。为了适应BLAS既包含内存绑定例程又包含计算绑定例程的特点，我们提出了一种混合策略，将容错融入到我们全新的BLAS实现中:为内存绑定的一级和二级BLAS例程复制计算指令，为计算绑定的三级BLAS例程引入基于算法的容错机制。我们的高性能和低开销是通过精细的汇编级优化和对计算内核的核融合方法获得的。实验结果表明，FT-BLAS具有高可靠性和高性能-即使在每分钟注入数百个错误的情况下，对于我们基准测试的所有三个级别的BLAS例程，FT-BLAS也比英特尔MKL, OpenBLAS和BLIS分别快3.50%，22.14%和21.70%。

{"title":"FT-BLAS: a high performance BLAS implementation with online fault tolerance","authors":"Yujia Zhai, Elisabeth Giem, Quan Fan, Kai Zhao, Jinyang Liu, Zizhong Chen","doi":"10.1145/3447818.3460364","DOIUrl":"https://doi.org/10.1145/3447818.3460364","url":null,"abstract":"Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate fault tolerance into our brand-new BLAS implementation: duplicating computing instructions for memory-bound Level-1 and Level-2 BLAS routines and incorporating an Algorithm-Based Fault Tolerance mechanism for computing-bound Level-3 BLAS routines. Our high performance and low overhead are obtained from delicate assembly-level optimization and a kernel-fusion approach to the computing kernels. Experimental results demonstrate that FT-BLAS offers high reliability and high performance -- faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14% and 21.70%, respectively, for routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72639019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

ClickTrain: efficient and accurate end-to-end deep learning training via fine-grained architecture-preserving pruning ClickTrain:通过细粒度架构保留修剪进行高效准确的端到端深度学习训练

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2020-11-20 DOI: 10.1145/3447818.3459988

Chengming Zhang, Geng Yuan, Wei Niu, Jiannan Tian, Sian Jin, Donglin Zhuang, Zhe Jiang, Yanzhi Wang, Bin Ren, S. Song, Dingwen Tao

Convolutional neural networks (CNNs) are becoming increasingly deeper, wider, and non-linear because of the growing demand on prediction accuracy and analysis quality. The wide and deep CNNs, however, require a large amount of computing resources and processing time. Many previous works have studied model pruning to improve inference performance, but little work has been done for effectively reducing training cost. In this paper, we propose ClickTrain: an efficient and accurate end-to-end training and pruning framework for CNNs. Different from the existing pruning-during-training work, ClickTrain provides higher model accuracy and compression ratio via fine-grained architecture-preserving pruning. By leveraging pattern-based pruning with our proposed novel accurate weight importance estimation, dynamic pattern generation and selection, and compiler-assisted computation optimizations, ClickTrain generates highly accurate and fast pruned CNN models for direct deployment without any time overhead, compared with the baseline training. ClickTrain also reduces the end-to-end time cost of the state-of-the-art pruning-after-training method by up to 2.3x with comparable accuracy and compression ratio. Moreover, compared with the state-of-the-art pruning-during-training approach, ClickTrain provides significant improvements both accuracy and compression ratio on the tested CNN models and datasets, under similar limited training time.

由于对预测精度和分析质量的要求越来越高，卷积神经网络(cnn)正日益向深度、广度和非线性方向发展。然而，宽深度cnn需要大量的计算资源和处理时间。以往很多研究都是通过对模型进行修剪来提高推理性能，但是对于有效降低训练成本的研究却很少。在本文中，我们提出了ClickTrain:一个高效、准确的cnn端到端训练和修剪框架。与现有的训练过程剪枝不同，ClickTrain通过细粒度的保留体系结构的剪枝提供了更高的模型精度和压缩比。与基线训练相比，ClickTrain通过利用基于模式的修剪和我们提出的新颖准确的权重重要性估计、动态模式生成和选择以及编译器辅助计算优化，生成了高度准确和快速修剪的CNN模型，用于直接部署，而无需任何时间开销。ClickTrain还将最先进的训练后修剪方法的端到端时间成本降低了2.3倍，同时具有相当的精度和压缩比。此外，与最先进的训练期间修剪方法相比，在相似的有限训练时间下，ClickTrain在测试的CNN模型和数据集上提供了显著的准确性和压缩比改进。

{"title":"ClickTrain: efficient and accurate end-to-end deep learning training via fine-grained architecture-preserving pruning","authors":"Chengming Zhang, Geng Yuan, Wei Niu, Jiannan Tian, Sian Jin, Donglin Zhuang, Zhe Jiang, Yanzhi Wang, Bin Ren, S. Song, Dingwen Tao","doi":"10.1145/3447818.3459988","DOIUrl":"https://doi.org/10.1145/3447818.3459988","url":null,"abstract":"Convolutional neural networks (CNNs) are becoming increasingly deeper, wider, and non-linear because of the growing demand on prediction accuracy and analysis quality. The wide and deep CNNs, however, require a large amount of computing resources and processing time. Many previous works have studied model pruning to improve inference performance, but little work has been done for effectively reducing training cost. In this paper, we propose ClickTrain: an efficient and accurate end-to-end training and pruning framework for CNNs. Different from the existing pruning-during-training work, ClickTrain provides higher model accuracy and compression ratio via fine-grained architecture-preserving pruning. By leveraging pattern-based pruning with our proposed novel accurate weight importance estimation, dynamic pattern generation and selection, and compiler-assisted computation optimizations, ClickTrain generates highly accurate and fast pruned CNN models for direct deployment without any time overhead, compared with the baseline training. ClickTrain also reduces the end-to-end time cost of the state-of-the-art pruning-after-training method by up to 2.3x with comparable accuracy and compression ratio. Moreover, compared with the state-of-the-art pruning-during-training approach, ClickTrain provides significant improvements both accuracy and compression ratio on the tested CNN models and datasets, under similar limited training time.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"57 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89734920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Task-graph scheduling extensions for efficient synchronization and communication 任务图调度扩展用于有效的同步和通信

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2020-11-06 DOI: 10.1145/3447818.3461616

Seonmyeong Bak, Oscar R. Hernandez, M. Gates, P. Luszczek, Vivek Sarkar

Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs.

任务图作为调度不规则并行应用程序的基础已经研究了几十年，并被纳入包括OpenMP在内的许多编程模型中。虽然许多高性能并行库都是基于任务图的，但它们也有额外的调度需求，比如数据并行性内部级别的同步和内部阻塞通信。在本文中，我们扩展了任务图调度，以支持任务内的有效同步和通信。与过去的工作相比，我们的调度器避免了死锁和工作线程的过度订阅，并改进了受害者选择，以增加兄弟任务的重叠。据我们所知，我们的方法是第一个在单个运行时中结合组合调度和窃取工作的方法。我们的方法已经在SLATE高性能线性代数库上进行了评估。相对于LLVM OMP运行时，我们的运行时对LU、QR和Cholesky的性能分别提高了13.82%、15.2%和36.94%，这是在与矩阵大小、节点数量和cpu与gpu的使用相关的不同配置中进行评估的。

{"title":"Task-graph scheduling extensions for efficient synchronization and communication","authors":"Seonmyeong Bak, Oscar R. Hernandez, M. Gates, P. Luszczek, Vivek Sarkar","doi":"10.1145/3447818.3461616","DOIUrl":"https://doi.org/10.1145/3447818.3461616","url":null,"abstract":"Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87715821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2