2020 57th ACM/IEEE Design Automation Conference (DAC)最新文献

英文中文

Symbolic Computer Algebra and SAT Based Information Forwarding for Fully Automatic Divider Verification 基于符号计算机代数和SAT的全自动分压器验证信息转发

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218721

Christoph Scholl, Alexander Konrad

During the last few years Symbolic Computer Algebra (SCA) delivered excellent results in the verification of large integer and finite field multipliers at the gate level. In contrast to those encouraging advances, SCA-based divider verification has been still in its infancy and awaited a major breakthrough. In this paper we analyze the fundamental reasons that prevented the success for SCA-based divider verification so far and present SAT Based Information Forwarding (SBIF). SBIF enhances SCA-based backward rewriting by information propagation in the opposite direction. We successfully apply the method to the fully automatic formal verification of large non-restoring dividers.

在过去的几年中，符号计算机代数(SCA)在门级验证大整数和有限域乘法器方面取得了优异的成绩。与这些令人鼓舞的进展相比，基于sca的分隔器验证仍处于起步阶段，等待重大突破。本文分析了迄今为止阻碍基于sca的分频器验证成功的根本原因，提出了基于SAT的信息转发(SBIF)。SBIF通过反向信息传播增强了基于sca的反向重写。我们成功地将该方法应用于大型非恢复分频器的全自动形式化验证。

引用次数: 5

Invited: Software Defined Accelerators From Learning Tools Environment 诚邀:来自学习工具环境的软件定义加速器

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218489

Antonino Tumeo, Marco Minutoli, Vito Giovanni Castellana, J. Manzano, Vinay C. Amatya, D. Brooks, Gu-Yeon Wei

Next generation systems, such as edge devices, will need to provide efficient processing of machine learning (ML) algorithms along several metrics, including energy, performance, area, and latency. However, the quickly evolving field of ML makes it extremely difficult to generate accelerators able to support a wide variety of algorithms. At the same time, designing accelerators in hardware description languages (HDLs) by hand is hard and time consuming, and does not allow quick exploration of the design space. In this paper we present the Software Defined Accelerators From Learning Tools Environment (SODALITE), an automated open source high-level ML framework-to-verilog compiler targeting ML Application-Specific Integrated Circuits (ASICs) chiplets. The SODALITE approach will implement optimal designs by seamlessly combining custom components generated through high-level synthesis (HLS) with templated and fully tunable Intellectual Properties (IPs) and macros, integrated in an extendable resource library. Through a closed loop design space exploration engine, developers will be able to quickly explore their hardware designs along different dimensions.

下一代系统，如边缘设备，将需要根据几个指标，包括能源、性能、面积和延迟，提供有效的机器学习(ML)算法处理。然而，快速发展的机器学习领域使得生成能够支持各种算法的加速器变得极其困难。同时，用硬件描述语言(hdl)手工设计加速器既困难又耗时，而且不允许快速探索设计空间。在本文中，我们介绍了来自学习工具环境(SODALITE)的软件定义加速器，这是一个自动化的开源高级ML框架到verilog编译器，针对ML专用集成电路(asic)小芯片。SODALITE方法通过无缝地将高级合成(HLS)生成的定制组件与模板和完全可调的知识产权(ip)和宏相结合，实现最佳设计，并集成在可扩展资源库中。通过闭环设计空间探索引擎，开发人员将能够沿着不同的维度快速探索他们的硬件设计。

{"title":"Invited: Software Defined Accelerators From Learning Tools Environment","authors":"Antonino Tumeo, Marco Minutoli, Vito Giovanni Castellana, J. Manzano, Vinay C. Amatya, D. Brooks, Gu-Yeon Wei","doi":"10.1109/DAC18072.2020.9218489","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218489","url":null,"abstract":"Next generation systems, such as edge devices, will need to provide efficient processing of machine learning (ML) algorithms along several metrics, including energy, performance, area, and latency. However, the quickly evolving field of ML makes it extremely difficult to generate accelerators able to support a wide variety of algorithms. At the same time, designing accelerators in hardware description languages (HDLs) by hand is hard and time consuming, and does not allow quick exploration of the design space. In this paper we present the Software Defined Accelerators From Learning Tools Environment (SODALITE), an automated open source high-level ML framework-to-verilog compiler targeting ML Application-Specific Integrated Circuits (ASICs) chiplets. The SODALITE approach will implement optimal designs by seamlessly combining custom components generated through high-level synthesis (HLS) with templated and fully tunable Intellectual Properties (IPs) and macros, integrated in an extendable resource library. Through a closed loop design space exploration engine, developers will be able to quickly explore their hardware designs along different dimensions.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122882004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

BitPruner: Network Pruning for Bit-serial Accelerators BitPruner:位串行加速器的网络修剪

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218534

Xiandong Zhao, Ying Wang, Cheng Liu, Cong Shi, Kaijie Tu, Lei Zhang

Bit-serial architectures (BSAs) are becoming increasingly popular in low power neural network processor (NNP) design. However, the performance and efficiency of state-of-the-art BSA NNPs are heavily depending on the distribution of ineffectual weight-bits of the running neural network. To boost the efficiency of third-party BSA accelerators, this work presents Bit-Pruner, a software approach to learn BSA-favored neural networks without resorting to hardware modifications. The techniques proposed in this work not only progressively prune but also structure the non-zero bits in weights, so that the number of zero-bits in the model can be increased and also load-balanced to suit the architecture of the target BSA accelerators. According to our experiments on a set of representative neural networks, Bit-Pruner increases the bit-sparsity up to 94.4% with negligible accuracy degradation. When the bit-pruned models are deployed onto typical BSA accelerators, the average performance is 2.1X and 1.5X higher than the baselines running non-pruned and weight-pruned networks, respectively.

位串行架构(BSAs)在低功耗神经网络处理器(NNP)设计中越来越受欢迎。然而，最先进的BSA nnp的性能和效率在很大程度上取决于运行神经网络的无效权重位的分布。为了提高第三方BSA加速器的效率，本研究提出了Bit-Pruner，这是一种无需修改硬件即可学习BSA青睐的神经网络的软件方法。本文提出的技术不仅可以对非零比特的权重进行逐步修剪，还可以对非零比特的权重进行结构化，从而可以增加模型中零比特的数量，并且可以负载均衡以适应目标BSA加速器的结构。根据我们在一组具有代表性的神经网络上的实验，Bit-Pruner将比特稀疏度提高到94.4%，而精度下降可以忽略不计。当将位修剪模型部署到典型的BSA加速器上时，平均性能分别比运行未修剪和权重修剪网络的基线高2.1倍和1.5倍。

引用次数: 13

Prediction Confidence based Low Complexity Gradient Computation for Accelerating DNN Training 基于预测置信度的低复杂度梯度计算加速DNN训练

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218650

Dongyeob Shin, Geonho Kim, Joongho Jo, Jongsun Park

In deep neural network (DNN) training, network weights are iteratively updated with the weight gradients that are obtained from stochastic gradient descent (SGD). Since SGD inherently allows gradient calculations with noise, approximating weight gradient computations have a large potential of training energy/time savings without degrading accuracy. In this paper, we propose an input-dependent approximation of the weight gradient for improving energy efficiency of training process. Considering that the output predictions of network (confidence) changes with training inputs, the relation between the confidence and the magnitude of weight gradient can be efficiently exploited to skip the gradient computations without accuracy drop, especially for high confidence inputs. With a given squared error constraint, the computation skip rates can be also controlled by changing the confidence threshold. The simulation results show that our approach can skip 72.6% of gradient computations for CIFAR-100 dataset using ResNet-18 without accuracy degradation. Hardware implementation with 65nm CMOS process shows that our design achieves 88.84% and 98.16% of maximum per epoch training energy and time savings, respectively, for CIFAR-100 dataset using ResNet-18 compared to state-of-the-art training accelerator.

在深度神经网络(DNN)训练中，使用随机梯度下降法(SGD)得到的权重梯度迭代更新网络权重。由于SGD固有地允许带噪声的梯度计算，因此近似权梯度计算具有很大的潜力，可以节省训练能量/时间，而不会降低准确性。为了提高训练过程的能量效率，我们提出了一种输入依赖的权梯度近似。考虑到网络的输出预测(置信度)随着训练输入的变化而变化，可以有效地利用置信度与权重梯度大小之间的关系，在不降低准确率的情况下跳过梯度计算，特别是对于高置信度的输入。在给定平方误差约束下，还可以通过改变置信阈值来控制计算跳过率。仿真结果表明，对于使用ResNet-18的CIFAR-100数据集，我们的方法可以跳过72.6%的梯度计算，且精度没有下降。采用65nm CMOS工艺的硬件实现表明，与最先进的训练加速器相比，我们的设计在使用ResNet-18的CIFAR-100数据集上，每个epoch最大训练能量和时间分别节省了88.84%和98.16%。

{"title":"Prediction Confidence based Low Complexity Gradient Computation for Accelerating DNN Training","authors":"Dongyeob Shin, Geonho Kim, Joongho Jo, Jongsun Park","doi":"10.1109/DAC18072.2020.9218650","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218650","url":null,"abstract":"In deep neural network (DNN) training, network weights are iteratively updated with the weight gradients that are obtained from stochastic gradient descent (SGD). Since SGD inherently allows gradient calculations with noise, approximating weight gradient computations have a large potential of training energy/time savings without degrading accuracy. In this paper, we propose an input-dependent approximation of the weight gradient for improving energy efficiency of training process. Considering that the output predictions of network (confidence) changes with training inputs, the relation between the confidence and the magnitude of weight gradient can be efficiently exploited to skip the gradient computations without accuracy drop, especially for high confidence inputs. With a given squared error constraint, the computation skip rates can be also controlled by changing the confidence threshold. The simulation results show that our approach can skip 72.6% of gradient computations for CIFAR-100 dataset using ResNet-18 without accuracy degradation. Hardware implementation with 65nm CMOS process shows that our design achieves 88.84% and 98.16% of maximum per epoch training energy and time savings, respectively, for CIFAR-100 dataset using ResNet-18 compared to state-of-the-art training accelerator.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129305058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

WET: Write Efficient Loop Tiling for Non-Volatile Main Memory 非易失性主存的写高效循环平铺

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218612

Mohammad A. Alshboul, James Tuck, Yan Solihin

Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (loop tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

未来的系统预计将越来越多地包括非易失性主存储器(NVMM)。但是，由于NVMM的写持久性有限，因此必须减少写次数。虽然已经提出了新的体系结构和算法来减少对NVMM的写操作，但很少或没有研究关注编译器优化对写操作的影响。在本文中，我们研究了一种流行的编译器优化(循环平铺)对一个非常重要的计算内核(矩阵乘法)的影响。我们的新观察包括，在矩阵乘法上平铺会导致25倍的写入放大。此外，我们还研究了通过选择正确的瓷砖尺寸和采用分层平铺来使平铺更加NVMM友好的技术。我们的方法write - efficient Tiling (WET)添加了一个新的外部tile，用于将写工作集拟合到最后一级缓存(LLC)，以减少对NVMM的写次数。我们的实验减少了81%的写入，同时提高了性能。

引用次数: 4

A Simple Cache Coherence Scheme for Integrated CPU-GPU Systems 一种用于CPU-GPU集成系统的简单缓存一致性方案

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218664

Ardhi Wiratama Baskara Yudha, Reza Pulungan, H. Hoffmann, Yan Solihin

This paper presents a novel approach to accelerate applications running on integrated CPU-GPU systems. Many integrated CPU-GPU systems use cache-coherent shared memory to communicate. For example, after CPU produces data for GPU, the GPU may pull the data into its cache when it accesses the data. In such a pull-based approach, data resides in a shared cache until the GPU accesses it, resulting in long load latency on a first GPU access to a cache line. In this work, we propose a new, push-based, coherence mechanism that explicitly exploits the CPU and GPU producer-consumer relationship by automatically moving data from CPU to GPU last-level cache. The proposed mechanism results in a dramatic reduction of the GPU L2 cache miss rate in general, and a consequent increase in overall performance. Our experiments show that the proposed scheme can increase performance by up to 37%, with typical improvements in the 5–7% range. We find that even when tested applications do not benefit from the proposed approach, their performance does not decrease with our technique. While we demonstrate how the proposed scheme can co-exist with traditional cache coherence mechanisms, we argue that it could also be used as a simpler replacement for existing protocols.

本文提出了一种加速运行在CPU-GPU集成系统上的应用程序的新方法。许多集成的CPU-GPU系统使用缓存一致的共享内存进行通信。例如，CPU为GPU生成数据后，GPU在访问数据时可能会将数据拉入缓存。在这种基于拉的方法中，数据驻留在共享缓存中，直到GPU访问它，导致第一个GPU访问缓存线的加载延迟很长。在这项工作中，我们提出了一种新的、基于推送的一致性机制，该机制通过自动将数据从CPU移动到GPU的最后一级缓存来明确地利用CPU和GPU的生产者-消费者关系。提议的机制导致GPU L2缓存丢失率的显著降低，并随之提高整体性能。我们的实验表明，提出的方案可以提高性能高达37%，典型的改进在5-7%的范围内。我们发现，即使被测试的应用程序没有从建议的方法中获益，它们的性能也不会因为我们的技术而下降。虽然我们展示了所提出的方案如何与传统的缓存一致性机制共存，但我们认为它也可以作为现有协议的更简单替代品。

{"title":"A Simple Cache Coherence Scheme for Integrated CPU-GPU Systems","authors":"Ardhi Wiratama Baskara Yudha, Reza Pulungan, H. Hoffmann, Yan Solihin","doi":"10.1109/DAC18072.2020.9218664","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218664","url":null,"abstract":"This paper presents a novel approach to accelerate applications running on integrated CPU-GPU systems. Many integrated CPU-GPU systems use cache-coherent shared memory to communicate. For example, after CPU produces data for GPU, the GPU may pull the data into its cache when it accesses the data. In such a pull-based approach, data resides in a shared cache until the GPU accesses it, resulting in long load latency on a first GPU access to a cache line. In this work, we propose a new, push-based, coherence mechanism that explicitly exploits the CPU and GPU producer-consumer relationship by automatically moving data from CPU to GPU last-level cache. The proposed mechanism results in a dramatic reduction of the GPU L2 cache miss rate in general, and a consequent increase in overall performance. Our experiments show that the proposed scheme can increase performance by up to 37%, with typical improvements in the 5–7% range. We find that even when tested applications do not benefit from the proposed approach, their performance does not decrease with our technique. While we demonstrate how the proposed scheme can co-exist with traditional cache coherence mechanisms, we argue that it could also be used as a simpler replacement for existing protocols.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"54 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120904137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

[Copyright notice] (版权)

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/dac18072.2020.9218731

引用次数: 0

Stealing Your Data from Compressed Machine Learning Models 从压缩机器学习模型中窃取数据

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218633

Nuo Xu, Qi Liu, Tao Liu, Zihao Liu, Xiaochen Guo, Wujie Wen

Machine learning models have been widely deployed in many real-world tasks. When a non-expert data holder wants to use a third-party machine learning service for model training, it is critical to preserve the confidentiality of the training data. In this paper, we for the first time explore the potential privacy leakage in a scenario that a malicious ML provider offers data holder customized training code including model compression which is essential in practical deployment The provider is unable to access the training process hosted by the secured third party, but could inquire models when they are released in public. As a result, adversary can extract sensitive training data with high quality even from these deeply compressed models that are tailored for resource-limited devices. Our investigation shows that existing compressions like quantization, can serve as a defense against such an attack, by degrading the model accuracy and memorized data quality simultaneously. To overcome this defense, we take an initial attempt to design a simple but stealthy quantized correlation encoding attack flow from an adversary perspective. Three integrated components-data pre-processing, layer-wise data-weight correlation regularization, data-aware quantization, are developed accordingly. Extensive experimental results show that our framework can preserve the evasiveness and effectiveness of stealing data from compressed models.

机器学习模型已经广泛应用于许多现实世界的任务中。当非专业数据持有者希望使用第三方机器学习服务进行模型训练时，保护训练数据的机密性至关重要。在本文中，我们首次探讨了恶意机器学习提供者提供数据持有者定制的训练代码(包括实际部署中必不可少的模型压缩)的潜在隐私泄露情况。提供者无法访问由安全第三方托管的训练过程，但可以在模型公开发布时查询模型。因此，对手甚至可以从这些为资源有限的设备量身定制的深度压缩模型中提取高质量的敏感训练数据。我们的调查表明，现有的压缩(如量化)可以通过同时降低模型精度和记忆数据质量来防御这种攻击。为了克服这种防御，我们首先尝试从对手的角度设计一个简单但隐蔽的量化相关编码攻击流。相应开发了数据预处理、分层数据权重关联正则化、数据感知量化三个集成组件。大量的实验结果表明，我们的框架可以保持从压缩模型中窃取数据的回避性和有效性。

{"title":"Stealing Your Data from Compressed Machine Learning Models","authors":"Nuo Xu, Qi Liu, Tao Liu, Zihao Liu, Xiaochen Guo, Wujie Wen","doi":"10.1109/DAC18072.2020.9218633","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218633","url":null,"abstract":"Machine learning models have been widely deployed in many real-world tasks. When a non-expert data holder wants to use a third-party machine learning service for model training, it is critical to preserve the confidentiality of the training data. In this paper, we for the first time explore the potential privacy leakage in a scenario that a malicious ML provider offers data holder customized training code including model compression which is essential in practical deployment The provider is unable to access the training process hosted by the secured third party, but could inquire models when they are released in public. As a result, adversary can extract sensitive training data with high quality even from these deeply compressed models that are tailored for resource-limited devices. Our investigation shows that existing compressions like quantization, can serve as a defense against such an attack, by degrading the model accuracy and memorized data quality simultaneously. To overcome this defense, we take an initial attempt to design a simple but stealthy quantized correlation encoding attack flow from an adversary perspective. Three integrated components-data pre-processing, layer-wise data-weight correlation regularization, data-aware quantization, are developed accordingly. Extensive experimental results show that our framework can preserve the evasiveness and effectiveness of stealing data from compressed models.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127039852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

INVITED: Computational Methods of Biological Exploration 邀请:生物探索的计算方法

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218671

Louis K. Scheffer

Our technical ability to collect data about biological systems far outpaces our ability to understand them. Historically, for example, we have had complete and explicit genomes for almost two decades, but we still have no idea what many genes do. More recently a similar situation has arisen, where we can reconstruct huge neural circuits, and/or watch them operate in the brain, but still don’t know how they work. This talk covers this second and newer problem, understanding neural circuits. We introduce a variety of computational tools currently being used to attack this data-rich, understanding-poor problems. Examples include dimensionality reduction for nonlinear systems, looking for known and proposed circuits, and using machine learning for parameter estimation. One general theme is the use of biological priors, to help fill in unknowns, see if proposed solutions are feasible, and more generally aid understanding.

我们收集生物系统数据的技术能力远远超过了我们理解它们的能力。例如，从历史上看，我们拥有完整和明确的基因组已经近二十年了，但我们仍然不知道许多基因的作用。最近出现了类似的情况，我们可以重建巨大的神经回路，并/或观察它们在大脑中的运作，但仍然不知道它们是如何工作的。这次演讲涵盖了第二个也是较新的问题，理解神经回路。我们介绍了各种各样的计算工具，目前被用来解决这个数据丰富，理解贫乏的问题。例子包括非线性系统的降维，寻找已知和建议的电路，以及使用机器学习进行参数估计。一个普遍的主题是利用生物先验来帮助填补未知，看看所提出的解决方案是否可行，并更普遍地帮助理解。

引用次数: 0

Taming Unstructured Sparsity on GPUs via Latency-Aware Optimization 通过延迟感知优化来驯服gpu上的非结构化稀疏性

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218644

Maohua Zhu, Yuan Xie

Neural Networks (NNs) exhibit high redundancy in their parameters so that pruning methods can achieve high compression ratio without accuracy loss. However, the very high sparsity produced by unstructured pruning methods is difficult to be efficiently mapped onto Graphics Processing Units (GPUs) because of its decoding overhead and workload imbalance. With the introduction of Tensor Core, the latest GPUs achieve even higher throughput for the dense neural networks. This makes unstructured neural networks fail to outperform their dense counterparts because they are not currently supported by Tensor Core. To tackle this problem, prior work suggests structured pruning to improve the performance of sparse NNs on GPUs. However, such structured pruning methods have to sacrifice a significant part of sparsity to retain the model accuracy, which limits the speedup on the hardware. In this paper, we observe that the Tensor Core is also able to compute unstructured sparse NNs efficiently. To achieve this goal, we first propose ExTensor, a set of sparse Tensor Core instructions with a variable input matrix tile size. The variable tile size allows a matrix multiplication to be implemented by mixing different types of ExTensor instructions. We build a performance model to estimate the latency of an ExTensor instruction given an operand sparse weight matrix. Based on this model, we propose a heuristic algorithm to find the optimal sequence of the instructions for an ExTensor based kernel to achieve the best performance on the GPU. Experimental results demonstrate that our approach achieves 36% better performance than the state-of-the-art sparse Tensor Core design.

神经网络的参数具有高冗余性，因此修剪方法可以在不损失精度的情况下获得高压缩比。然而，非结构化剪枝方法产生的非常高的稀疏性由于其解码开销和工作负载不平衡而难以有效地映射到图形处理单元上。随着Tensor Core的引入，最新的gpu为密集的神经网络实现了更高的吞吐量。这使得非结构化神经网络无法胜过密集神经网络，因为它们目前不受Tensor Core的支持。为了解决这个问题，之前的工作建议采用结构化修剪来提高gpu上稀疏神经网络的性能。然而，这种结构化的剪枝方法必须牺牲很大一部分的稀疏性来保持模型的准确性，这限制了硬件上的加速。在本文中，我们观察到张量核心也能够有效地计算非结构化稀疏神经网络。为了实现这一目标，我们首先提出了ExTensor，这是一组具有可变输入矩阵大小的稀疏张量核心指令。可变的块大小允许通过混合不同类型的ExTensor指令来实现矩阵乘法。在给定一个操作数稀疏权矩阵的情况下，我们建立了一个性能模型来估计ExTensor指令的延迟。基于该模型，我们提出了一种启发式算法，为基于ExTensor的内核寻找最优指令序列，以在GPU上实现最佳性能。实验结果表明，我们的方法比最先进的稀疏张量核心设计的性能提高了36%。

{"title":"Taming Unstructured Sparsity on GPUs via Latency-Aware Optimization","authors":"Maohua Zhu, Yuan Xie","doi":"10.1109/DAC18072.2020.9218644","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218644","url":null,"abstract":"Neural Networks (NNs) exhibit high redundancy in their parameters so that pruning methods can achieve high compression ratio without accuracy loss. However, the very high sparsity produced by unstructured pruning methods is difficult to be efficiently mapped onto Graphics Processing Units (GPUs) because of its decoding overhead and workload imbalance. With the introduction of Tensor Core, the latest GPUs achieve even higher throughput for the dense neural networks. This makes unstructured neural networks fail to outperform their dense counterparts because they are not currently supported by Tensor Core. To tackle this problem, prior work suggests structured pruning to improve the performance of sparse NNs on GPUs. However, such structured pruning methods have to sacrifice a significant part of sparsity to retain the model accuracy, which limits the speedup on the hardware. In this paper, we observe that the Tensor Core is also able to compute unstructured sparse NNs efficiently. To achieve this goal, we first propose ExTensor, a set of sparse Tensor Core instructions with a variable input matrix tile size. The variable tile size allows a matrix multiplication to be implemented by mixing different types of ExTensor instructions. We build a performance model to estimate the latency of an ExTensor instruction given an operand sparse weight matrix. Based on this model, we propose a heuristic algorithm to find the optimal sequence of the instructions for an ExTensor based kernel to achieve the best performance on the GPU. Experimental results demonstrate that our approach achieves 36% better performance than the state-of-the-art sparse Tensor Core design.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127496304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 57th ACM/IEEE Design Automation Conference (DAC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀