2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)最新文献

英文中文

AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU AutoGTCO:图形和张量协同优化的图像识别与GPU上的变压器

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643487

Yang Bai, Xufeng Yao, Qi Sun, Bei Yu

Performance optimization is the art of continuously seeking an effective mapping between algorithm and hardware. Existing deep learning compilers or frameworks optimize the computation graph via adapting transformations manually designed by expert efforts. We argue that these methods ignore some possible graph-level optimizations, thus it is difficult to generalize to emerging deep learning models or new operators. In this work, we propose AutoGTCO, a tensor program generation system for vision tasks with the transformer architecture on GPU. Compared with existing fusion strategies, AutoGTCO explores the optimization of operator fusion in the transformer model through a novel dynamic programming algorithm. Specifically, to construct an effective search space of the sampled programs, new sketch generation rules and a search policy are proposed for the batch matrix multiplication and softmax operators in each subgraph, which are capable of fusing them into large computation units, it can then map and transform them into efficient CUDA kernels. Overall, our evaluation on three real-world transformer-based vision tasks shows that AutoGTCO improves the execution performance relative to deep learning engine TensorRT by up to 1.38 ×.

性能优化是不断寻求算法和硬件之间有效映射的艺术。现有的深度学习编译器或框架通过适应由专家手工设计的转换来优化计算图。我们认为这些方法忽略了一些可能的图级优化，因此很难推广到新兴的深度学习模型或新的操作符。在这项工作中，我们提出了AutoGTCO，一个基于GPU的变压器架构的用于视觉任务的张量程序生成系统。与现有的融合策略相比，AutoGTCO通过一种新的动态规划算法探索了变压器模型中算子融合的优化问题。具体而言，为了构建采样程序的有效搜索空间，针对每个子图中的批矩阵乘法和softmax算子，提出了新的草图生成规则和搜索策略，能够将它们融合成大型计算单元，然后将它们映射并转换为高效的CUDA核。总体而言，我们对三个现实世界中基于变压器的视觉任务的评估表明，相对于深度学习引擎TensorRT, AutoGTCO将执行性能提高了1.38倍。

{"title":"AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU","authors":"Yang Bai, Xufeng Yao, Qi Sun, Bei Yu","doi":"10.1109/ICCAD51958.2021.9643487","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643487","url":null,"abstract":"Performance optimization is the art of continuously seeking an effective mapping between algorithm and hardware. Existing deep learning compilers or frameworks optimize the computation graph via adapting transformations manually designed by expert efforts. We argue that these methods ignore some possible graph-level optimizations, thus it is difficult to generalize to emerging deep learning models or new operators. In this work, we propose AutoGTCO, a tensor program generation system for vision tasks with the transformer architecture on GPU. Compared with existing fusion strategies, AutoGTCO explores the optimization of operator fusion in the transformer model through a novel dynamic programming algorithm. Specifically, to construct an effective search space of the sampled programs, new sketch generation rules and a search policy are proposed for the batch matrix multiplication and softmax operators in each subgraph, which are capable of fusing them into large computation units, it can then map and transform them into efficient CUDA kernels. Overall, our evaluation on three real-world transformer-based vision tasks shows that AutoGTCO improves the execution performance relative to deep learning engine TensorRT by up to 1.38 ×.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130852835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

DALTA: A Decomposition-based Approximate Lookup Table Architecture 一种基于分解的近似查找表体系结构

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643562

Chang Meng, Z. Xiang, Niyiqiu Liu, Yixuan Hu, Jiahao Song, Runsheng Wang, Ru Huang, Weikang Qian

A popular way to implement an arithmetic function is through a lookup table (LUT), which stores the pre-computed outputs for all the inputs. However, its size grows exponentially with the number of input bits. In this work, targeting at computing kernels of error-tolerant applications, we propose DALTA, a reconfigurable decomposition-based approximate lookup table architecture, to approximately implement those kernels with dramatically reduced size. We also propose integer linear programming-based approximate decomposition methods to map a given function to the architecture. Our architecture features with low energy consumption and high speed. The experimental results show that our architecture achieves energy and latency savings by 56.5% and 92.4%, respectively, over the state-of-the-art approximate LUT architecture.

实现算术函数的一种流行方法是通过查找表(LUT)，该表存储所有输入的预先计算的输出。然而，它的大小随着输入比特的数量呈指数增长。在这项工作中，针对容错应用程序的计算内核，我们提出了一种可重构的基于分解的近似查找表体系结构DALTA，以显着减少大小来近似实现这些内核。我们还提出了基于整数线性规划的近似分解方法，将给定函数映射到体系结构。我们的架构具有低能耗和高速度的特点。实验结果表明，与最先进的近似LUT架构相比，我们的架构分别节省了56.5%和92.4%的能量和延迟。

引用次数: 1

ICCAD Special Session Paper: Quantum Variational Methods for Quantum Applications ICCAD特别会议论文:量子应用的量子变分方法

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643519

Shouvanik Chakrabarti, Xuchen You, Xiaodi Wu

Quantum Variational Methods are promising near-term applications of quantum machines, not only because of their potential advantages in solving certain computational tasks and understanding quantum physics but also because of their feasibility on near-term quantum machines. However, many challenges remain in order to unleash the full potential of quantum variational methods, especially in the design of efficient training methods for each domain-specific quantum variational ansatzes. This paper proposes a theory-guided principle in order to tackle the training issue of quantum variational methods and highlights some successful examples.

量子变分方法是量子机器的近期应用，不仅因为它们在解决某些计算任务和理解量子物理方面的潜在优势，而且因为它们在近期量子机器上的可行性。然而，为了释放量子变分方法的全部潜力，特别是在为每个特定领域的量子变分分析设计有效的训练方法方面，仍然存在许多挑战。本文提出了一个理论指导原则来解决量子变分方法的训练问题，并重点介绍了一些成功的例子。

引用次数: 0

ParaMitE: Mitigating Parasitic CNFETs in the Presence of Unetched CNTs ParaMitE:在未蚀刻碳纳米管存在下减轻寄生cnfet

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643513

Sanmitra Banerjee, Arjun Chaudhuri, Jinwoo Kim, Gauthaman Murali, M. Nelson, S. Lim, K. Chakrabarty

Carbon nanotube FETs (CNFETs) are emerging as an alternative to silicon devices for next-generation computing systems. However, imperfect carbon nanotube deposition during CNFET fabrication can lead to the formation of difficult-to-etch CNT aggregates in the active layer. These CNT aggregates can form parasitic CNFETs (para-FETs) that are modulated by adjoining gate contacts or back-end-of-line metal layers, thereby forming conditional shorts and stuck-at faults. We show that even weak (parametric) para-FETs can lead to a degraded static noise margin in CNFET-based design. We propose ParaMitE, a layout optimization method that horizontally flips selected standard cells in situ to minimize the number of para-FETs that can arise due to unetched CNTs. As we modify only the cell orientation (and not the cell placement), the impact on the power, timing, and wire length of the CNFET-based design is negligible. Simulation results for several benchmarks show that the proposed method can mitigate up to 60% of the possible para-FET locations (90% of the most critical locations) with only a 3% increase in the total wire length. ParaMitE can enable yield ramp-up at the foundry by providing guidance on which para-FETs can be avoided by design, and conversely, which CNT aggregates must be removed through processing steps.

碳纳米管场效应管(cnfet)正在成为下一代计算系统中硅器件的替代品。然而，在CNFET制造过程中，不完美的碳纳米管沉积会导致在活性层中形成难以蚀刻的碳纳米管聚集体。这些碳纳米管聚集体可以形成寄生的cnfet (para- fet)，由相邻的栅极触点或后端线金属层调制，从而形成条件短路和卡在故障。研究表明，在基于cnfet的设计中，即使是弱(参数)准场效应管也会导致静态噪声裕度下降。我们提出了ParaMitE，这是一种布局优化方法，可将选定的标准单元水平翻转，以最大限度地减少由于未蚀刻碳纳米管而产生的para- fet的数量。由于我们只修改单元方向(而不是单元位置)，因此对基于cnfet的设计的功率、时序和导线长度的影响可以忽略不计。几个基准测试的仿真结果表明，所提出的方法可以减少高达60%的可能的准场效应管位置(90%的最关键位置)，而总导线长度仅增加3%。ParaMitE可以在铸造厂提供指导，通过设计可以避免哪些para- fet，反过来，哪些碳纳米管聚集体必须通过加工步骤去除，从而实现产量的提高。

{"title":"ParaMitE: Mitigating Parasitic CNFETs in the Presence of Unetched CNTs","authors":"Sanmitra Banerjee, Arjun Chaudhuri, Jinwoo Kim, Gauthaman Murali, M. Nelson, S. Lim, K. Chakrabarty","doi":"10.1109/ICCAD51958.2021.9643513","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643513","url":null,"abstract":"Carbon nanotube FETs (CNFETs) are emerging as an alternative to silicon devices for next-generation computing systems. However, imperfect carbon nanotube deposition during CNFET fabrication can lead to the formation of difficult-to-etch CNT aggregates in the active layer. These CNT aggregates can form parasitic CNFETs (para-FETs) that are modulated by adjoining gate contacts or back-end-of-line metal layers, thereby forming conditional shorts and stuck-at faults. We show that even weak (parametric) para-FETs can lead to a degraded static noise margin in CNFET-based design. We propose ParaMitE, a layout optimization method that horizontally flips selected standard cells in situ to minimize the number of para-FETs that can arise due to unetched CNTs. As we modify only the cell orientation (and not the cell placement), the impact on the power, timing, and wire length of the CNFET-based design is negligible. Simulation results for several benchmarks show that the proposed method can mitigate up to 60% of the possible para-FET locations (90% of the most critical locations) with only a 3% increase in the total wire length. ParaMitE can enable yield ramp-up at the foundry by providing guidance on which para-FETs can be avoided by design, and conversely, which CNT aggregates must be removed through processing steps.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117042537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On-chip Optical Routing with Waveguide Matching Constraints 具有波导匹配约束的片上光路由

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/iccad51958.2021.9643560

Fu-Yu Chuang, Yao-Wen Chang

Photonic integrated circuits (PICs), which introduce optical interconnections for on-chip communication, have become one of the most promising solutions to the increasing requirements with large bandwidth and low-power consumption. Routing techniques for optical interconnections have been proposed to deal with various routing issues in PICs, including transmission losses, thermal reliability, etc. However, in some emerging applications, different optical paths should be closely matched (in terms of the path length, the number of bends, the radius of curvature of bends, and the crossing count) to operate correctly. To the best of our knowledge, no previous work deals with these matching constraints in optical routing. This paper proposes a complete algorithm flow based on an optimal Steiner tree construction and integer linear programming with a hexagonal routing style to handle the matching constraints while minimizing the total transmission loss in a design. Compared with A*-search-based net-matching routing, experimental results show that our optical router can route all nets without violating any matching constraints while achieving lower total/maximum transmission loss, based on the optical netlists from a state-of-the-art work.

光子集成电路(PICs)引入了用于片上通信的光互连，已成为满足日益增长的大带宽和低功耗要求的最有前途的解决方案之一。光互连的路由技术已经被提出来处理PICs中的各种路由问题，包括传输损耗、热可靠性等。然而，在一些新兴的应用中，不同的光路必须紧密匹配(在路径长度、弯道数量、弯道曲率半径和交叉计数方面)才能正确运行。据我们所知，以前没有工作处理光路由中的这些匹配约束。本文提出了一种基于最优Steiner树构造和六边形路由风格的整数线性规划的完整算法流程，以在最小化总传输损耗的情况下处理匹配约束。与基于A*搜索的网络匹配路由相比，实验结果表明，基于最先进的光网络列表，我们的光路由器可以在不违反任何匹配约束的情况下路由所有网络，同时获得更低的总/最大传输损耗。

{"title":"On-chip Optical Routing with Waveguide Matching Constraints","authors":"Fu-Yu Chuang, Yao-Wen Chang","doi":"10.1109/iccad51958.2021.9643560","DOIUrl":"https://doi.org/10.1109/iccad51958.2021.9643560","url":null,"abstract":"Photonic integrated circuits (PICs), which introduce optical interconnections for on-chip communication, have become one of the most promising solutions to the increasing requirements with large bandwidth and low-power consumption. Routing techniques for optical interconnections have been proposed to deal with various routing issues in PICs, including transmission losses, thermal reliability, etc. However, in some emerging applications, different optical paths should be closely matched (in terms of the path length, the number of bends, the radius of curvature of bends, and the crossing count) to operate correctly. To the best of our knowledge, no previous work deals with these matching constraints in optical routing. This paper proposes a complete algorithm flow based on an optimal Steiner tree construction and integer linear programming with a hexagonal routing style to handle the matching constraints while minimizing the total transmission loss in a design. Compared with A*-search-based net-matching routing, experimental results show that our optical router can route all nets without violating any matching constraints while achieving lower total/maximum transmission loss, based on the optical netlists from a state-of-the-art work.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116015182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel Clock Tree Aware Placement Methodology for Single Flux Quantum (SFQ) Logic Circuits 一种单通量量子(SFQ)逻辑电路的时钟树感知放置方法

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643507

Ching-Cheng Wang, Wai-Kei Mak

In a single-flux-quantum (SFQ) circuit, almost all cells need to receive the clock signal which incurs a high clock routing overhead. Besides, the clock tree of an SFQ circuit requires the insertion of a clock splitter cell at every tree branching point which renders the conventional design flow of placement followed by clock tree synthesis ineffective to obtain a high quality clock tree with low clock skew. To address these issues, we propose a two-stage global placement methodology and a placement refinement algorithm after placement legalization. Our two-stage global placement methodology first applies a conventional global placement algorithm to place the cells in the given SFQ circuit evenly, which is followed by clock tree synthesis and clock splitter insertion, and then performs a second stage of global placement to re-place both the original cells and clock splitters at the same time. In the second global placement stage, the look-ahead legalization technique is used to spread out the original cells and the clock splitters, and the clock tree is re-synthesized several times to obtain an optimized clock tree topology such that there are little overlaps of the clock splitters with the original circuit cells. In addition, the total wirelength of data signals and clock signal is optimized concurrently. After legalizing the placement of all cells, our placement refinement method can be run to further reduce the clock skew. Compared with the previous state-of-the-art work, on average we can reduce the total half-perimeter wirelength and clock skew by 9% and 31%. respectively.

在单通量量子(SFQ)电路中，几乎所有单元都需要接收时钟信号，这导致了很高的时钟路由开销。此外，SFQ电路的时钟树需要在每个树分支点插入时钟分配器单元，这使得传统的放置然后合成时钟树的设计流程无法获得低时钟倾斜的高质量时钟树。为了解决这些问题，我们提出了一种两阶段的全局安置方法和安置合法化后的安置优化算法。我们的两阶段全局布局方法首先应用传统的全局布局算法将单元均匀地放置在给定的SFQ电路中，然后进行时钟树合成和时钟分离器插入，然后执行第二阶段的全局布局，同时重新放置原始单元和时钟分离器。在第二次全局布局阶段，采用前瞻性合法化技术展开原始单元和时钟分配器，并对时钟树进行多次重新合成，得到一个优化的时钟树拓扑，使时钟分配器与原始电路单元很少重叠。同时对数据信号和时钟信号的总长度进行了优化。在使所有单元格的位置合法化之后，可以运行我们的位置优化方法来进一步减少时钟倾斜。与之前最先进的工作相比，我们平均可以将总半周长和时钟偏差分别减少9%和31%。分别。

{"title":"A Novel Clock Tree Aware Placement Methodology for Single Flux Quantum (SFQ) Logic Circuits","authors":"Ching-Cheng Wang, Wai-Kei Mak","doi":"10.1109/ICCAD51958.2021.9643507","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643507","url":null,"abstract":"In a single-flux-quantum (SFQ) circuit, almost all cells need to receive the clock signal which incurs a high clock routing overhead. Besides, the clock tree of an SFQ circuit requires the insertion of a clock splitter cell at every tree branching point which renders the conventional design flow of placement followed by clock tree synthesis ineffective to obtain a high quality clock tree with low clock skew. To address these issues, we propose a two-stage global placement methodology and a placement refinement algorithm after placement legalization. Our two-stage global placement methodology first applies a conventional global placement algorithm to place the cells in the given SFQ circuit evenly, which is followed by clock tree synthesis and clock splitter insertion, and then performs a second stage of global placement to re-place both the original cells and clock splitters at the same time. In the second global placement stage, the look-ahead legalization technique is used to spread out the original cells and the clock splitters, and the clock tree is re-synthesized several times to obtain an optimized clock tree topology such that there are little overlaps of the clock splitters with the original circuit cells. In addition, the total wirelength of data signals and clock signal is optimized concurrently. After legalizing the placement of all cells, our placement refinement method can be run to further reduce the clock skew. Compared with the previous state-of-the-art work, on average we can reduce the total half-perimeter wirelength and clock skew by 9% and 31%. respectively.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125998549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Toward Security Closure in the Face of Reliability Effects ICCAD Special Session Paper 面对可靠性影响的安全封闭。ICCAD特别会议论文

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643447

J. Lienig, Susann Rothe, Matthias Thiele, N. Rangarajan, M. Ashraf, M. Nabeel, H. Amrouch, O. Sinanoglu, J. Knechtel

The reliable operation of ICs is subject to physical effects like electromigration, thermal and stress migration, negative bias temperature instability, hot-carrier injection, etc. While these effects have been studied thoroughly for IC design, threats of their subtle exploitation are not captured well yet. In this paper, we open up a path for security closure of physical layouts in the face of reliability effects. Toward that end, we first review migration effects in interconnects and aging effects in transistors, along with established and emerging means for handling these effects during IC design. Next, we study security threats arising from these effects; in particular, we cover migration effects-based, disruptive Trojans and aging-exacerbated side-channel leakage. Finally, we outline corresponding strategies for security closure of physical layouts, along with an outline for CAD frameworks.

集成电路的可靠运行受到电迁移、热迁移和应力迁移、负偏置温度不稳定性、热载流子注入等物理效应的影响。虽然这些影响已经为IC设计进行了彻底的研究，但它们的微妙利用的威胁还没有被很好地捕捉到。本文为面对可靠性影响的物理布局的安全封闭开辟了一条路径。为此，我们首先回顾了互连中的迁移效应和晶体管中的老化效应，以及在IC设计期间处理这些效应的现有和新兴方法。接下来，我们研究了这些影响带来的安全威胁;特别是，我们涵盖了基于迁移效应，破坏性木马和老化加剧的侧通道泄漏。最后，我们概述了物理布局的安全关闭的相应策略，以及CAD框架的概述。

引用次数: 4

Bit-Transformer: Transforming Bit-level Sparsity into Higher Preformance in ReRAM-based Accelerator 位转换器:在基于reram的加速器中将位级稀疏性转化为更高的性能

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643569

Fangxin Liu, Wenbo Zhao, Zhezhi He, Zongwu Wang, Yilong Zhao, Yongbiao Chen, Li Jiang

Resistive Random-Access-Memory (ReRAM) crossbar is one of the most promising neural network accelerators, thanks to its in-memory and in-situ analog computing abilities for Matrix Multiplication-and-Accumulations (MACs). Nevertheless, the number of rows and columns of ReRAM cells for concurrent execution of MACs is constrained, resulting in limited in-memory computing throughput. Moreover, it is challenging to deploy Deep Neural Network(DNN) models with large model size in the crossbar, since the sparsity of DNNs cannot be effectively exploited in the crossbar structure. As the countermeasure, we develop a novel ReRAM-based DNN accelerator, named Bit-Transformer, which pays attention to the correlation between the bit-level sparsity and the performance of the ReRAM-based crossbar. We propose a superior bit-flip scheme combined with the exponent-based quantization, which can adaptively flip the bits of the mapped DNNs to release redundant space without sacrificing the accuracy much or incurring much hardware overhead. Meanwhile, we design an architecture that can integrate the techniques to massively shrink the crossbar footprint to be used. In this way, It efficiently leverages the bit-level sparsity for performance gains while reducing the energy consumption of computation. The comprehensive experiments indicate that our Bit-Transformer outperforms prior state-of-the-art designs up to 13 x, 35 x, and 67 x, in terms of energy-efficiency, area-efficiency, and throughput, respectively. Code will be open-source in the camera-ready version.

电阻随机存取存储器(ReRAM)交叉棒是最有前途的神经网络加速器之一，由于其在内存和原位模拟计算矩阵乘法和积累(mac)的能力。然而，用于mac并发执行的ReRAM单元的行数和列数受到限制，导致内存中计算吞吐量有限。此外，深层神经网络(Deep Neural Network, DNN)模型的稀疏性无法在交叉栏结构中得到有效利用，因此在交叉栏结构中部署大模型尺寸的深度神经网络(Deep Neural Network, DNN)模型具有挑战性。作为应对措施，我们开发了一种新的基于reram的深度神经网络加速器Bit-Transformer，它关注了比特级稀疏度与基于reram的交叉棒性能之间的相关性。我们提出了一种与指数量化相结合的优越的位翻转方案，该方案可以自适应地翻转映射dnn的位以释放冗余空间，而不会牺牲太多的精度或产生太多的硬件开销。同时，我们设计了一个可以集成技术的架构，以大规模地缩小横梁占用空间。通过这种方式，它有效地利用了比特级稀疏性来提高性能，同时减少了计算的能耗。综合实验表明，我们的Bit-Transformer在能效、面积效率和吞吐量方面分别优于先前最先进的设计高达13倍、35倍和67倍。代码将在相机版本中开放源代码。

{"title":"Bit-Transformer: Transforming Bit-level Sparsity into Higher Preformance in ReRAM-based Accelerator","authors":"Fangxin Liu, Wenbo Zhao, Zhezhi He, Zongwu Wang, Yilong Zhao, Yongbiao Chen, Li Jiang","doi":"10.1109/ICCAD51958.2021.9643569","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643569","url":null,"abstract":"Resistive Random-Access-Memory (ReRAM) crossbar is one of the most promising neural network accelerators, thanks to its in-memory and in-situ analog computing abilities for Matrix Multiplication-and-Accumulations (MACs). Nevertheless, the number of rows and columns of ReRAM cells for concurrent execution of MACs is constrained, resulting in limited in-memory computing throughput. Moreover, it is challenging to deploy Deep Neural Network(DNN) models with large model size in the crossbar, since the sparsity of DNNs cannot be effectively exploited in the crossbar structure. As the countermeasure, we develop a novel ReRAM-based DNN accelerator, named Bit-Transformer, which pays attention to the correlation between the bit-level sparsity and the performance of the ReRAM-based crossbar. We propose a superior bit-flip scheme combined with the exponent-based quantization, which can adaptively flip the bits of the mapped DNNs to release redundant space without sacrificing the accuracy much or incurring much hardware overhead. Meanwhile, we design an architecture that can integrate the techniques to massively shrink the crossbar footprint to be used. In this way, It efficiently leverages the bit-level sparsity for performance gains while reducing the energy consumption of computation. The comprehensive experiments indicate that our Bit-Transformer outperforms prior state-of-the-art designs up to 13 x, 35 x, and 67 x, in terms of energy-efficiency, area-efficiency, and throughput, respectively. Code will be open-source in the camera-ready version.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126901899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Optimizing VLSI Implementation with Reinforcement Learning - ICCAD Special Session Paper 用强化学习优化VLSI实现- ICCAD特别会议论文

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643589

Haoxing Ren, Saad Godil, Brucek Khailany, Robert Kirby, Haiguang Liao, S. Nath, Jonathan Raiman, Rajarshi Roy

Reinforcement learning (RL) has gained attention recently as an optimization algorithm for chip design. This method treats many chip design problems as Markov decision problems (MDPs), where design optimization objectives are converted into rewards given by the environment and design variables are converted into actions provided to the environment. Some recent examples include applications of RL to macro placement and standard cell layout routing. We believe RL can be applied to nearly all aspects of VLSI implementation flows, since many VLSI implementation problems are often NP-complete and state-of-art algorithms cannot be guaranteed to be optimal. With enough training data, it is possible to achieve better results with RL. In this paper we review recent advances in applying RL to VLSI implementation problems such as cell layout, synthesis, placement, routing and parameter tuning. We discuss the challenges of applying RL to VLSI implementation flows and propose future research directions for overcoming these challenges.

强化学习(RL)作为芯片设计的一种优化算法，近年来受到了广泛的关注。该方法将许多芯片设计问题视为马尔可夫决策问题(mdp)，将设计优化目标转化为环境给予的奖励，将设计变量转化为提供给环境的动作。最近的一些例子包括RL在宏放置和标准单元布局路由中的应用。我们相信RL可以应用于VLSI实现流程的几乎所有方面，因为许多VLSI实现问题通常是np完全的，并且最先进的算法不能保证是最优的。有了足够的训练数据，强化学习就有可能获得更好的结果。在本文中，我们回顾了RL在VLSI实现问题上的最新进展，如单元布局、合成、放置、路由和参数调谐。我们讨论了将RL应用于VLSI实现流程的挑战，并提出了克服这些挑战的未来研究方向。

{"title":"Optimizing VLSI Implementation with Reinforcement Learning - ICCAD Special Session Paper","authors":"Haoxing Ren, Saad Godil, Brucek Khailany, Robert Kirby, Haiguang Liao, S. Nath, Jonathan Raiman, Rajarshi Roy","doi":"10.1109/ICCAD51958.2021.9643589","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643589","url":null,"abstract":"Reinforcement learning (RL) has gained attention recently as an optimization algorithm for chip design. This method treats many chip design problems as Markov decision problems (MDPs), where design optimization objectives are converted into rewards given by the environment and design variables are converted into actions provided to the environment. Some recent examples include applications of RL to macro placement and standard cell layout routing. We believe RL can be applied to nearly all aspects of VLSI implementation flows, since many VLSI implementation problems are often NP-complete and state-of-art algorithms cannot be guaranteed to be optimal. With enough training data, it is possible to achieve better results with RL. In this paper we review recent advances in applying RL to VLSI implementation problems such as cell layout, synthesis, placement, routing and parameter tuning. We discuss the challenges of applying RL to VLSI implementation flows and propose future research directions for overcoming these challenges.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126347718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

dCSR: A Memory-Efficient Sparse Matrix Representation for Parallel Neural Network Inference 并行神经网络推理的高效记忆稀疏矩阵表示

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643506

E. Trommer, Bernd Waschneck, Akash Kumar

Reducing the memory footprint of neural networks is a crucial prerequisite for deploying them in small and low-cost embedded devices. Network parameters can often be reduced significantly through pruning. We discuss how to best represent the indexing overhead of sparse networks for the coming generation of Single Instruction, Multiple Data (SIMD)-capable microcontrollers. From this, we develop Delta-Compressed Storage Row (dCSR), a storage format for sparse matrices that allows for both low overhead storage and fast inference on embedded systems with wide SIMD units. We demonstrate our method on an ARM Cortex-M55 MCU prototype with M-Profile Vector Extension (MVE). A comparison of memory consumption and throughput shows that our method achieves competitive compression ratios and increases throughput over dense methods by up to $2.9times$ for sparse matrix-vector multiplication (SpMV)-based kernels and $1.06times$ for sparse matrix-matrix multiplication (SpMM). This is accomplished through handling the generation of index information directly in the SIMD unit, leading to an increase in effective memory bandwidth.

减少神经网络的内存占用是将其部署在小型低成本嵌入式设备中的关键先决条件。网络参数通常可以通过修剪显著减少。我们讨论了如何最好地表示稀疏网络的索引开销为下一代的单指令，多数据(SIMD)能力的微控制器。由此，我们开发了delta压缩存储行(dCSR)，这是一种用于稀疏矩阵的存储格式，允许在具有宽SIMD单元的嵌入式系统上进行低开销存储和快速推理。我们在ARM Cortex-M55单片机的M-Profile Vector Extension (MVE)原型上演示了我们的方法。内存消耗和吞吐量的比较表明，我们的方法实现了具有竞争力的压缩比，并且基于稀疏矩阵-向量乘法(SpMV)的内核的吞吐量比密集方法提高了2.9倍，基于稀疏矩阵-矩阵乘法(SpMM)的内核的吞吐量提高了1.06倍。这是通过直接在SIMD单元中处理索引信息的生成来实现的，从而增加了有效的内存带宽。

{"title":"dCSR: A Memory-Efficient Sparse Matrix Representation for Parallel Neural Network Inference","authors":"E. Trommer, Bernd Waschneck, Akash Kumar","doi":"10.1109/ICCAD51958.2021.9643506","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643506","url":null,"abstract":"Reducing the memory footprint of neural networks is a crucial prerequisite for deploying them in small and low-cost embedded devices. Network parameters can often be reduced significantly through pruning. We discuss how to best represent the indexing overhead of sparse networks for the coming generation of Single Instruction, Multiple Data (SIMD)-capable microcontrollers. From this, we develop Delta-Compressed Storage Row (dCSR), a storage format for sparse matrices that allows for both low overhead storage and fast inference on embedded systems with wide SIMD units. We demonstrate our method on an ARM Cortex-M55 MCU prototype with M-Profile Vector Extension (MVE). A comparison of memory consumption and throughput shows that our method achieves competitive compression ratios and increases throughput over dense methods by up to $2.9times$ for sparse matrix-vector multiplication (SpMV)-based kernels and $1.06times$ for sparse matrix-matrix multiplication (SpMM). This is accomplished through handling the generation of index information directly in the SIMD unit, leading to an increase in effective memory bandwidth.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122838414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀