首页 > 最新文献

2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)最新文献

英文 中文
Towards warp-scheduler friendly STT-RAM/SRAM hybrid GPGPU register file design 对扭曲调度器友好的STT-RAM/SRAM混合GPGPU寄存器文件设计
Pub Date : 2017-11-13 DOI: 10.1109/ICCAD.2017.8203850
Quan Deng, Youtao Zhang, Minxuan Zhang, Jun Yang
Modern Graphics Processing Units (GPUs) widely adopt large SRAM based register file (RF) to enable fast context-switch. A large SRAM RF may consume 20% to 40% GPU power, which has become one of the major design challenges for GPUs. Recent studies mitigate the issue through hybrid RF designs that architect a large STT-RAM (Spin Transfer Torque Magnetic memory) RF and a small SRAM buffer. However, the long STT-RAM write latency throttles the data exchange between STT-RAM and SRAM, which deprecates warp scheduler with frequent context switches, e.g., round robin scheduler. In this paper, we propose HC-RF, a warp-scheduler friendly hybrid RF design using novel SRAM/STT-RAM hybrid cell (HC) structure. HC-RF exploits cell level integration to improve the effective bandwidth between STT-RAM and SRAM. By enabling silent data transfer from SRAM to STT-RAM without blocking RF banks, HC-RF supports concurrent context-switching and decouples its dependency on warp scheduler. Our experimental results show that, on average, HC-RF achieves 50% performance improvement and 44% energy consumption reduction over the coarse-grained hybrid design when adopting LRR(Loose Round Robin) warp scheduler.
现代图形处理单元(gpu)广泛采用基于SRAM的大寄存器文件(RF)来实现快速上下文切换。一个大的SRAM RF可能会消耗20%到40%的GPU功率,这已经成为GPU的主要设计挑战之一。最近的研究通过混合射频设计缓解了这个问题,该设计构建了一个大型STT-RAM(自旋传递扭矩磁存储器)射频和一个小型SRAM缓冲区。然而,STT-RAM的长写延迟限制了STT-RAM和SRAM之间的数据交换,这不利于使用频繁上下文切换的warp调度器,例如轮询调度器。在本文中,我们提出了HC-RF,一种基于SRAM/STT-RAM混合单元(HC)结构的扭曲调度友好型混合RF设计。HC-RF利用小区级集成来提高STT-RAM和SRAM之间的有效带宽。通过在不阻塞射频库的情况下实现从SRAM到STT-RAM的静默数据传输,HC-RF支持并发上下文切换,并解耦了对warp调程程序的依赖。实验结果表明,采用LRR(Loose Round Robin) warp scheduler时,HC-RF比粗粒度混合设计平均性能提高50%,能耗降低44%。
{"title":"Towards warp-scheduler friendly STT-RAM/SRAM hybrid GPGPU register file design","authors":"Quan Deng, Youtao Zhang, Minxuan Zhang, Jun Yang","doi":"10.1109/ICCAD.2017.8203850","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203850","url":null,"abstract":"Modern Graphics Processing Units (GPUs) widely adopt large SRAM based register file (RF) to enable fast context-switch. A large SRAM RF may consume 20% to 40% GPU power, which has become one of the major design challenges for GPUs. Recent studies mitigate the issue through hybrid RF designs that architect a large STT-RAM (Spin Transfer Torque Magnetic memory) RF and a small SRAM buffer. However, the long STT-RAM write latency throttles the data exchange between STT-RAM and SRAM, which deprecates warp scheduler with frequent context switches, e.g., round robin scheduler. In this paper, we propose HC-RF, a warp-scheduler friendly hybrid RF design using novel SRAM/STT-RAM hybrid cell (HC) structure. HC-RF exploits cell level integration to improve the effective bandwidth between STT-RAM and SRAM. By enabling silent data transfer from SRAM to STT-RAM without blocking RF banks, HC-RF supports concurrent context-switching and decouples its dependency on warp scheduler. Our experimental results show that, on average, HC-RF achieves 50% performance improvement and 44% energy consumption reduction over the coarse-grained hybrid design when adopting LRR(Loose Round Robin) warp scheduler.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124142057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
SAT-based compilation to a non-vonNeumann processor 基于sat的编译到非诺伊曼处理器
Pub Date : 2017-11-13 DOI: 10.1109/ICCAD.2017.8203842
S. Chaudhuri, A. Hetzel
This paper describes a compilation technique used to accelerate dataflow computations, common in deep neural network computing, onto Coarse Grained Reconfigurable Array (CGRA) architectures. This technique has been demonstrated to automatically compile dataflow programs onto a commercial massively parallel CGRA-based dataflow processor (DPU) containing 16000 processing elements. The DPU architecture overcomes the von Neumann bottleneck by spatially flowing and reusing data from local memories, and provides higher computation efficiency compared to temporal parallel architectures such as GPUs and multi-core CPUs. However, existing software development tools for CGRAs are limited to compiling domain specific programs to processing elements with uniform structures, and are not effective on complex micro architectures where latencies of memory access vary in a nontrivial fashion depending on data locality. A primary contribution of this paper is to provide a general algorithm that can compile general dataflow graphs, and can efficiently utilize processing elements with rich micro-architectural features such as complex instructions, multi-precision data paths, local memories, register files, switches etc. Another contribution is a uniquely innovative application of Boolean Satisfiability to formally solve this complex, and irregular optimization problem and produce high-quality results comparable to hand-written assembly code produced by human experts. A third contribution is an adaptive windowing algorithm that harnesses the complexity of the SAT-based approach and delivers a scalable and robust solution.
本文描述了一种用于在粗粒度可重构阵列(CGRA)架构上加速深度神经网络计算中常见的数据流计算的编译技术。该技术已被证明可以自动将数据流程序编译到包含16000个处理元素的商用大规模并行基于cgra的数据流处理器(DPU)上。DPU架构通过空间流动和重用本地内存中的数据来克服冯·诺依曼瓶颈,与gpu和多核cpu等时间并行架构相比,提供了更高的计算效率。然而,现有的用于CGRAs的软件开发工具仅限于编译特定领域的程序来处理具有统一结构的元素,并且在复杂的微体系结构中不有效,因为内存访问延迟会根据数据位置而以一种重要的方式变化。本文的主要贡献是提供了一种通用算法,可以编译通用数据流图,并能有效地利用具有复杂指令、多精度数据路径、本地存储器、寄存器文件、开关等丰富微结构特征的处理元素。另一个贡献是布尔可满足性的独特创新应用,它正式解决了这个复杂的、不规则的优化问题,并产生了可与人类专家编写的汇编代码相媲美的高质量结果。第三个贡献是自适应窗口算法,该算法利用了基于sat方法的复杂性,并提供了可扩展且健壮的解决方案。
{"title":"SAT-based compilation to a non-vonNeumann processor","authors":"S. Chaudhuri, A. Hetzel","doi":"10.1109/ICCAD.2017.8203842","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203842","url":null,"abstract":"This paper describes a compilation technique used to accelerate dataflow computations, common in deep neural network computing, onto Coarse Grained Reconfigurable Array (CGRA) architectures. This technique has been demonstrated to automatically compile dataflow programs onto a commercial massively parallel CGRA-based dataflow processor (DPU) containing 16000 processing elements. The DPU architecture overcomes the von Neumann bottleneck by spatially flowing and reusing data from local memories, and provides higher computation efficiency compared to temporal parallel architectures such as GPUs and multi-core CPUs. However, existing software development tools for CGRAs are limited to compiling domain specific programs to processing elements with uniform structures, and are not effective on complex micro architectures where latencies of memory access vary in a nontrivial fashion depending on data locality. A primary contribution of this paper is to provide a general algorithm that can compile general dataflow graphs, and can efficiently utilize processing elements with rich micro-architectural features such as complex instructions, multi-precision data paths, local memories, register files, switches etc. Another contribution is a uniquely innovative application of Boolean Satisfiability to formally solve this complex, and irregular optimization problem and produce high-quality results comparable to hand-written assembly code produced by human experts. A third contribution is an adaptive windowing algorithm that harnesses the complexity of the SAT-based approach and delivers a scalable and robust solution.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"322 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116296315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Dedicated synthesis for MZI-based optical circuits based on AND-inverter graphs 基于and -逆变图的mzi光电路专用合成
Pub Date : 2017-11-13 DOI: 10.1109/ICCAD.2017.8203783
Arighna Deb, R. Wille, R. Drechsler
Optical circuits received significant interest as a promising alternative to existing electronic systems. Because of this, also the synthesis of optical circuits receives increasing attention. However, initial solutions for the synthesis of optical circuits either rely on manual design or rather straight-forward mappings from established data-structures such as BDDs, SoPs/ESoPs, etc. to the corresponding optical netlist. These approaches hardly utilize the full potential of the gate libraries available in this domain. In this paper, we propose an alternative synthesis solution based on AND-Inverter Graphs (AIGs) which is capable of utilizing this potential. That is, a scheme is presented which dedicatedly maps the given function representation to the desired circuit in a one-to-one fashion — yielding significantly smaller circuit sizes. Experimental evaluations confirm that the proposed solution generates optical circuits with up to 97% less number of gates as compared to existing synthesis approaches.
光学电路作为现有电子系统的一种有希望的替代方案受到了极大的兴趣。正因为如此,光电路的合成也越来越受到重视。然而,光电路合成的初始解决方案要么依赖于手工设计,要么依赖于从已建立的数据结构(如bdd、sop /ESoPs等)到相应的光网络表的直接映射。这些方法几乎没有充分利用这个领域中可用的gate库的潜力。在本文中,我们提出了一种基于and -逆变器图(AIGs)的替代综合解决方案,该方案能够利用这种潜力。也就是说,提出了一种方案,专门将给定的函数表示以一对一的方式映射到所需的电路-产生显着更小的电路尺寸。实验评估证实,与现有的合成方法相比,所提出的解决方案产生的光电路的门数减少了97%。
{"title":"Dedicated synthesis for MZI-based optical circuits based on AND-inverter graphs","authors":"Arighna Deb, R. Wille, R. Drechsler","doi":"10.1109/ICCAD.2017.8203783","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203783","url":null,"abstract":"Optical circuits received significant interest as a promising alternative to existing electronic systems. Because of this, also the synthesis of optical circuits receives increasing attention. However, initial solutions for the synthesis of optical circuits either rely on manual design or rather straight-forward mappings from established data-structures such as BDDs, SoPs/ESoPs, etc. to the corresponding optical netlist. These approaches hardly utilize the full potential of the gate libraries available in this domain. In this paper, we propose an alternative synthesis solution based on AND-Inverter Graphs (AIGs) which is capable of utilizing this potential. That is, a scheme is presented which dedicatedly maps the given function representation to the desired circuit in a one-to-one fashion — yielding significantly smaller circuit sizes. Experimental evaluations confirm that the proposed solution generates optical circuits with up to 97% less number of gates as compared to existing synthesis approaches.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134429887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
ApproxLUT: A novel approximate lookup table-based accelerator ApproxLUT:一种新颖的基于近似查找表的加速器
Pub Date : 2017-11-13 DOI: 10.1109/ICCAD.2017.8203810
Ye Tian, Ting Wang, Qian Zhang, Q. Xu
Computing with memory, which stores function responses of some input patterns into lookup tables offline and retrieves their values when encountering similar patterns (instead of performing online calculation), is a promising energy-efficient computing technique. No doubt to say, with a given lookup table size, the efficiency of this technique depends on which function responses are stored and how they are organized. In this paper, we propose a novel adaptive approximate lookup table based accelerator, wherein we store function responses in a hierarchical manner with increasing fine-grained granularity and accuracy. In addition, the proposed accelerator provides lightweight compensation on output results at different precision levels according to input patterns and output quality requirements. Moreover, our accelerator conducts adaptive lookup table search by exploiting input locality. Experimental results on various computation kernels show significant energy savings of the proposed accelerator over prior solutions.
内存计算是一种很有前途的节能计算技术,它将一些输入模式的函数响应离线存储到查找表中,并在遇到类似模式时检索它们的值(而不是执行在线计算)。毫无疑问,对于给定的查找表大小,这种技术的效率取决于存储哪些函数响应以及它们是如何组织的。在本文中,我们提出了一种新的基于自适应近似查找表的加速器,其中我们以分层的方式存储函数响应,增加了细粒度的粒度和精度。此外,所提出的加速器根据输入模式和输出质量要求,对不同精度级别的输出结果提供轻量级补偿。此外,我们的加速器通过利用输入局部性进行自适应查找表搜索。在不同计算核上的实验结果表明,所提出的加速器比先前的解决方案节能显著。
{"title":"ApproxLUT: A novel approximate lookup table-based accelerator","authors":"Ye Tian, Ting Wang, Qian Zhang, Q. Xu","doi":"10.1109/ICCAD.2017.8203810","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203810","url":null,"abstract":"Computing with memory, which stores function responses of some input patterns into lookup tables offline and retrieves their values when encountering similar patterns (instead of performing online calculation), is a promising energy-efficient computing technique. No doubt to say, with a given lookup table size, the efficiency of this technique depends on which function responses are stored and how they are organized. In this paper, we propose a novel adaptive approximate lookup table based accelerator, wherein we store function responses in a hierarchical manner with increasing fine-grained granularity and accuracy. In addition, the proposed accelerator provides lightweight compensation on output results at different precision levels according to input patterns and output quality requirements. Moreover, our accelerator conducts adaptive lookup table search by exploiting input locality. Experimental results on various computation kernels show significant energy savings of the proposed accelerator over prior solutions.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124524554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
State retention for power gated design with non-uniform multi-bit retention latches 非均匀多位保持锁存器的电源门控设计的状态保持
Pub Date : 2017-11-13 DOI: 10.1109/ICCAD.2017.8203833
Guo-Gin Fan, Mark Po-Hung Lin
Retention registers/latches are commonly applied to power-gated circuits for state retention during the sleep mode. Recent studies have shown that applying uniform multi-bit retention registers (MBRRs) can reduce the storage size, and hence save more chip area and leakage power compared with single-bit retention registers. In this paper, a new problem formulation of power-gated circuit optimization with nonuniform MBRRs is studied for achieving even more storage saving and higher storage utilization. An ILP-based approach is proposed to effectively explore different combinations of nonuniform MBRR replacement. Experiment results show that the proposed approach can reduce 36% storage size, compared with the state-of-the-art uniform MBRR replacement, while achieving 100% storage utilization.
保持寄存器/锁存器通常应用于电源门控电路中,用于睡眠模式期间的状态保持。近年来的研究表明,采用统一的多比特保持寄存器(mbrr)可以减小存储尺寸,从而比单比特保持寄存器节省更多的芯片面积和泄漏功率。为了实现更大的存储节省和更高的存储利用率,本文研究了一种具有非均匀mbrr的功率门控电路优化问题的新公式。提出了一种基于ilp的方法来有效地探索非均匀MBRR替换的不同组合。实验结果表明,与目前最先进的均匀MBRR替换方法相比,该方法可以减少36%的存储容量,同时实现100%的存储利用率。
{"title":"State retention for power gated design with non-uniform multi-bit retention latches","authors":"Guo-Gin Fan, Mark Po-Hung Lin","doi":"10.1109/ICCAD.2017.8203833","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203833","url":null,"abstract":"Retention registers/latches are commonly applied to power-gated circuits for state retention during the sleep mode. Recent studies have shown that applying uniform multi-bit retention registers (MBRRs) can reduce the storage size, and hence save more chip area and leakage power compared with single-bit retention registers. In this paper, a new problem formulation of power-gated circuit optimization with nonuniform MBRRs is studied for achieving even more storage saving and higher storage utilization. An ILP-based approach is proposed to effectively explore different combinations of nonuniform MBRR replacement. Experiment results show that the proposed approach can reduce 36% storage size, compared with the state-of-the-art uniform MBRR replacement, while achieving 100% storage utilization.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121379501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Approximate image storage with multi-level cell STT-MRAM main memory 近似图像存储与多级单元STT-MRAM主存储器
Pub Date : 2017-11-13 DOI: 10.1109/ICCAD.2017.8203788
Hengyu Zhao, Linuo Xue, Ping Chi, Jishen Zhao
Images consume significant storage and space in both consumer devices and in the cloud. As such, image processing applications impose high energy consumption in loading and accessing the image data in the memory. Fortunately, most image processing applications can tolerate approximate image data storage. In addition, multi-level cell spin-transfer torque MRAM (STT-MRAM) offers unique design opportunities as the image memory: the two bits in the memory cell require asymmetric write current — the soft bit requires much less write current than the hard bit. This paper proposes an approximate image processing scheme that improves system energy efficiency without upsetting image quality requirement of applications. Our design consists of (i) an approximate image storage mechanism that strives to only write the soft bits in MLC STT-MRAM main memory with small write current and (ii) a memory mode controller that determines the approximation of image data and coordinates across precise/approximate memory access modes. Our experimental results with various image processing functionalities demonstrate that our design reduces memory access energy consumption by 53% and 2.3 x with 100% user's satisfaction compared with traditional DRAM-based and MLC phase-change-memory-based main memory, respectively.
无论是在消费者设备还是在云中,映像都会消耗大量的存储和空间。因此,图像处理应用程序在加载和访问内存中的图像数据时施加了高能耗。幸运的是,大多数图像处理应用程序可以容忍近似的图像数据存储。此外,多级单元自旋传递扭矩MRAM (STT-MRAM)作为图像存储器提供了独特的设计机会:存储单元中的两个比特需要不对称的写入电流-软比特比硬比特需要更少的写入电流。本文提出了一种近似的图像处理方案,在不影响应用对图像质量要求的前提下,提高了系统的能量效率。我们的设计包括(i)一个近似的图像存储机制,努力只在MLC STT-MRAM主存储器中写入软位,写入电流小;(ii)一个存储模式控制器,确定图像数据的近似和精确/近似存储访问模式之间的坐标。我们的实验结果表明,与传统的基于dram和基于MLC相变存储器的主存储器相比,我们的设计将存储器访问能耗分别降低了53%和2.3倍,用户满意度为100%。
{"title":"Approximate image storage with multi-level cell STT-MRAM main memory","authors":"Hengyu Zhao, Linuo Xue, Ping Chi, Jishen Zhao","doi":"10.1109/ICCAD.2017.8203788","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203788","url":null,"abstract":"Images consume significant storage and space in both consumer devices and in the cloud. As such, image processing applications impose high energy consumption in loading and accessing the image data in the memory. Fortunately, most image processing applications can tolerate approximate image data storage. In addition, multi-level cell spin-transfer torque MRAM (STT-MRAM) offers unique design opportunities as the image memory: the two bits in the memory cell require asymmetric write current — the soft bit requires much less write current than the hard bit. This paper proposes an approximate image processing scheme that improves system energy efficiency without upsetting image quality requirement of applications. Our design consists of (i) an approximate image storage mechanism that strives to only write the soft bits in MLC STT-MRAM main memory with small write current and (ii) a memory mode controller that determines the approximation of image data and coordinates across precise/approximate memory access modes. Our experimental results with various image processing functionalities demonstrate that our design reduces memory access energy consumption by 53% and 2.3 x with 100% user's satisfaction compared with traditional DRAM-based and MLC phase-change-memory-based main memory, respectively.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124333165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
FPGA placement and routing FPGA放置和路由
Pub Date : 2017-11-13 DOI: 10.1109/ICCAD.2017.8203878
Shih-Chun Chen, Yao-Wen Chang
FPGAs have emerged as a popular style for modern circuit designs, due mainly to their non-recurring costs, in-field reprogrammability, short turn-around time, etc. A modern FPGA consists of an array of heterogeneous logic components, surrounded by routing resources and bounded by I/O cells. Compared to an ASIC, an FPGA has more limited logic and routing resources, diverse architectures, strict design constraints, etc.; as a result, FPGA placement and routing problems become much more challenging. With growing complexity, diverse design objectives, high heterogeneity, and evolving technologies, further, modern FPGA placement and routing bring up many emerging research opportunities. In this paper, we introduce basic architectures of FPGAs, describe the placement and routing problems for FPGAs, and explain key techniques to solve the problems (including three major placement paradigms: partitioning, simulated annealing, and analytical placement; two routing paradigms: sequential and concurrent routing, and simultaneous placement and routing). Finally, we provide some future research directions for FPGA placement and routing.
fpga已经成为现代电路设计的流行风格,主要是由于它们的非经常性成本,现场可重新编程性,短的周转时间等。现代FPGA由一组异构逻辑组件组成,由路由资源包围,并以I/O单元为界。与ASIC相比,FPGA具有更有限的逻辑和路由资源、多样化的架构、严格的设计约束等;因此,FPGA的放置和路由问题变得更具挑战性。随着复杂性的增加、设计目标的多样化、高异构性和技术的发展,现代FPGA的放置和路由带来了许多新兴的研究机会。在本文中,我们介绍了fpga的基本架构,描述了fpga的布局和路由问题,并解释了解决问题的关键技术(包括三种主要的布局范式:划分,模拟退火和分析布局;两种路由范例:顺序和并发路由,以及同时放置和路由)。最后,对FPGA的布局和路由提出了未来的研究方向。
{"title":"FPGA placement and routing","authors":"Shih-Chun Chen, Yao-Wen Chang","doi":"10.1109/ICCAD.2017.8203878","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203878","url":null,"abstract":"FPGAs have emerged as a popular style for modern circuit designs, due mainly to their non-recurring costs, in-field reprogrammability, short turn-around time, etc. A modern FPGA consists of an array of heterogeneous logic components, surrounded by routing resources and bounded by I/O cells. Compared to an ASIC, an FPGA has more limited logic and routing resources, diverse architectures, strict design constraints, etc.; as a result, FPGA placement and routing problems become much more challenging. With growing complexity, diverse design objectives, high heterogeneity, and evolving technologies, further, modern FPGA placement and routing bring up many emerging research opportunities. In this paper, we introduce basic architectures of FPGAs, describe the placement and routing problems for FPGAs, and explain key techniques to solve the problems (including three major placement paradigms: partitioning, simulated annealing, and analytical placement; two routing paradigms: sequential and concurrent routing, and simultaneous placement and routing). Finally, we provide some future research directions for FPGA placement and routing.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"9 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129794885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
AEP: An error-bearing neural network accelerator for energy efficiency and model protection 用于能源效率和模型保护的容错神经网络加速器
Pub Date : 2017-11-13 DOI: 10.1109/ICCAD.2017.8203854
Lei Zhao, Youtao Zhang, Jun Yang
Neural Networks (NNs) have recently gained popularity in a wide range of modern application domains due to its superior inference accuracy. With growing problem size and complexity, modern NNs, e.g., CNNs (Convolutional NNs) and DNNs (Deep NNs), contain a large number of weights, which require tremendous efforts not only to prepare representative training datasets but also to train the network. There is an increasing demand to protect the NN weight matrices, an emerging Intellectual Property (IP) in NN field. Unfortunately, adopting conventional encryption method faces significant performance and energy consumption overheads. In this paper, we propose AEP, a DianNao based NN accelerator design for IP protection. AEP aggressively reduces DRAM timing to generate a device dependent error mask, i.e., a set of erroneous cells while the distribution of these cells are device dependent due to process variations. AEP incorporates the error mask in the NN training process so that the trained weights are device dependent, which effectively defects IP piracy as exporting the weights to other devices cannot produce satisfactory inference accuracy. In addition, AEP speeds up NN inference and achieves significant energy reduction due to the fact that main memory dominates the energy consumption in DianNao accelerator. Our evaluation results show that by injecting 0.1% to 5% memory errors, AEP has negligible inference accuracy loss on the target device while exhibiting unacceptable accuracy degradation on other devices. In addition, AEP achieves an average of 72% performance improvement and 44% energy reduction over the DianNao baseline.
神经网络由于其优越的推理精度,近年来在现代应用领域得到了广泛的应用。随着问题规模和复杂性的不断增长,现代的神经网络,如cnn(卷积神经网络)和dnn(深度神经网络),包含了大量的权重,这不仅需要准备有代表性的训练数据集,而且需要付出巨大的努力来训练网络。神经网络权矩阵作为神经网络领域的一项新兴知识产权,其保护需求日益增长。不幸的是,采用传统的加密方法面临着显著的性能和能耗开销。本文提出了一种基于DianNao的知识产权保护神经网络加速器AEP设计。AEP积极降低DRAM时序,以生成一个设备相关的错误掩码,即一组错误单元,而这些单元的分布由于工艺变化而与设备相关。AEP在神经网络训练过程中引入了错误掩码,使得训练出的权值与设备相关,有效地防止了IP盗版,因为将权值导出到其他设备无法产生令人满意的推理精度。此外,AEP加速了神经网络的推理速度,并且由于主存在DianNao加速器中占据了主要的能量消耗,从而实现了显著的能量降低。我们的评估结果表明,通过注入0.1%到5%的内存错误,AEP在目标设备上的推理精度损失可以忽略不计,而在其他设备上则表现出不可接受的精度下降。此外,AEP实现了平均72%的性能提高和44%的能源减少,比电老基线。
{"title":"AEP: An error-bearing neural network accelerator for energy efficiency and model protection","authors":"Lei Zhao, Youtao Zhang, Jun Yang","doi":"10.1109/ICCAD.2017.8203854","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203854","url":null,"abstract":"Neural Networks (NNs) have recently gained popularity in a wide range of modern application domains due to its superior inference accuracy. With growing problem size and complexity, modern NNs, e.g., CNNs (Convolutional NNs) and DNNs (Deep NNs), contain a large number of weights, which require tremendous efforts not only to prepare representative training datasets but also to train the network. There is an increasing demand to protect the NN weight matrices, an emerging Intellectual Property (IP) in NN field. Unfortunately, adopting conventional encryption method faces significant performance and energy consumption overheads. In this paper, we propose AEP, a DianNao based NN accelerator design for IP protection. AEP aggressively reduces DRAM timing to generate a device dependent error mask, i.e., a set of erroneous cells while the distribution of these cells are device dependent due to process variations. AEP incorporates the error mask in the NN training process so that the trained weights are device dependent, which effectively defects IP piracy as exporting the weights to other devices cannot produce satisfactory inference accuracy. In addition, AEP speeds up NN inference and achieves significant energy reduction due to the fact that main memory dominates the energy consumption in DianNao accelerator. Our evaluation results show that by injecting 0.1% to 5% memory errors, AEP has negligible inference accuracy loss on the target device while exhibiting unacceptable accuracy degradation on other devices. In addition, AEP achieves an average of 72% performance improvement and 44% energy reduction over the DianNao baseline.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116920797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
DAGSENS: Directed acyclic graph based direct and adjoint transient sensitivity analysis for event-driven objective functions DAGSENS:基于有向无环图的事件驱动目标函数的直接和伴随瞬态灵敏度分析
Pub Date : 2017-11-13 DOI: 10.1109/ICCAD.2017.8203773
K. Aadithya, E. Keiter, Ting Mei
We present DAGSENS, a new approach to parametric transient sensitivity analysis of Differential Algebraic Equation systems (DAEs), such as SPICE-level circuits. The key ideas behind DAGSENS are, (1) to represent the entire sequence of computations from DAE parameters to the objective function (whose sensitivity is needed) as a Directed Acyclic Graph (DAG) called the “sensitivity DAG”, and (2) to compute the required sensitivites efficiently by using dynamic programming techniques to traverse the DAG. DAGSENS is simple, elegant, and easy-to-understand compared to previous approaches; for example, in DAGSENS, one can switch between direct and adjoint sensitivities simply by reversing the direction of DAG traversal. Also, DAGSENS is more powerful than previous approaches because it works for a more general class of objective functions, including those based on “events” that occur during a transient simulation (e.g., a node voltage crossing a threshold, a phase-locked loop (PLL) achieving lock, a circuit signal reaching its maximum/minimum value, etc.). In this paper, we demonstrate DAGSENS on several electronic and biological applications, including high-speed communication, statistical cell library characterization, and gene expression.
我们提出了DAGSENS,一种新的方法来分析微分代数方程系统(DAEs)的参数瞬态灵敏度,如spice级电路。DAGSENS背后的关键思想是:(1)将DAE参数到目标函数(其灵敏度需要)的整个计算序列表示为称为“灵敏度DAG”的有向无环图(DAG),以及(2)通过使用动态规划技术遍历DAG来有效地计算所需的灵敏度。与以前的方法相比,DAGSENS简单,优雅,易于理解;例如,在DAGSENS中,人们可以通过简单地反转DAG遍历的方向来在直接灵敏度和伴随灵敏度之间切换。此外,DAGSENS比以前的方法更强大,因为它适用于更一般的目标函数,包括那些基于瞬态仿真期间发生的“事件”的函数(例如,节点电压越过阈值,锁相环(PLL)实现锁定,电路信号达到最大值/最小值等)。在本文中,我们展示了DAGSENS在几个电子和生物应用,包括高速通信,统计细胞文库表征和基因表达。
{"title":"DAGSENS: Directed acyclic graph based direct and adjoint transient sensitivity analysis for event-driven objective functions","authors":"K. Aadithya, E. Keiter, Ting Mei","doi":"10.1109/ICCAD.2017.8203773","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203773","url":null,"abstract":"We present DAGSENS, a new approach to parametric transient sensitivity analysis of Differential Algebraic Equation systems (DAEs), such as SPICE-level circuits. The key ideas behind DAGSENS are, (1) to represent the entire sequence of computations from DAE parameters to the objective function (whose sensitivity is needed) as a Directed Acyclic Graph (DAG) called the “sensitivity DAG”, and (2) to compute the required sensitivites efficiently by using dynamic programming techniques to traverse the DAG. DAGSENS is simple, elegant, and easy-to-understand compared to previous approaches; for example, in DAGSENS, one can switch between direct and adjoint sensitivities simply by reversing the direction of DAG traversal. Also, DAGSENS is more powerful than previous approaches because it works for a more general class of objective functions, including those based on “events” that occur during a transient simulation (e.g., a node voltage crossing a threshold, a phase-locked loop (PLL) achieving lock, a circuit signal reaching its maximum/minimum value, etc.). In this paper, we demonstrate DAGSENS on several electronic and biological applications, including high-speed communication, statistical cell library characterization, and gene expression.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129470125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ORCHARD: Visual object recognition accelerator based on approximate in-memory processing ORCHARD:基于近似内存处理的视觉对象识别加速器
Pub Date : 2017-11-13 DOI: 10.1109/ICCAD.2017.8203756
Yeseong Kim, M. Imani, T. Simunic
In recent years, machine learning for visual object recognition has been applied to various domains, e.g., autonomous vehicle, heath diagnose, and home automation. However, the recognition procedures still consume a lot of processing energy and incur a high cost of data movement for memory accesses. In this paper, we propose a novel hardware accelerator design, called ORCHARD, which processes the object recognition tasks inside memory. The proposed design accelerates both the image feature extraction and boosting-based learning algorithm, which are key subtasks of the state-of-the-art image recognition approaches. We optimize the recognition procedures by leveraging approximate computing and emerging non-volatile memory (NVM) technology. The NVM-based in-memory processing allows the proposed design to mitigate the CMOS-based computation overhead, highly improving the system efficiency. In our evaluation conducted on circuit- and device-level simulations, we show that ORCHARD successfully performs practical image recognition tasks, including text, face, pedestrian, and vehicle recognition with 0.3% of accuracy loss made by computation approximation. In addition, our design significantly improves the performance and energy efficiency by up to 376x and 1896x, respectively, compared to the existing processor-based implementation.
近年来,机器学习在视觉对象识别方面的应用已广泛应用于自动驾驶汽车、健康诊断和家庭自动化等领域。然而,识别过程仍然消耗大量的处理能量,并且产生较高的内存访问数据移动成本。在本文中,我们提出了一种新的硬件加速器设计,称为ORCHARD,它在内存中处理目标识别任务。提出的设计加速了图像特征提取和基于增强的学习算法,这是最先进的图像识别方法的关键子任务。我们通过利用近似计算和新兴的非易失性存储器(NVM)技术来优化识别过程。基于nvm的内存处理允许所提出的设计减少基于cmos的计算开销,极大地提高了系统效率。在我们对电路和设备级模拟进行的评估中,我们表明ORCHARD成功地执行了实际的图像识别任务,包括文本、人脸、行人和车辆识别,通过计算近似获得的精度损失为0.3%。此外,与现有的基于处理器的实现相比,我们的设计显著提高了性能和能源效率,分别提高了376x和1896x。
{"title":"ORCHARD: Visual object recognition accelerator based on approximate in-memory processing","authors":"Yeseong Kim, M. Imani, T. Simunic","doi":"10.1109/ICCAD.2017.8203756","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203756","url":null,"abstract":"In recent years, machine learning for visual object recognition has been applied to various domains, e.g., autonomous vehicle, heath diagnose, and home automation. However, the recognition procedures still consume a lot of processing energy and incur a high cost of data movement for memory accesses. In this paper, we propose a novel hardware accelerator design, called ORCHARD, which processes the object recognition tasks inside memory. The proposed design accelerates both the image feature extraction and boosting-based learning algorithm, which are key subtasks of the state-of-the-art image recognition approaches. We optimize the recognition procedures by leveraging approximate computing and emerging non-volatile memory (NVM) technology. The NVM-based in-memory processing allows the proposed design to mitigate the CMOS-based computation overhead, highly improving the system efficiency. In our evaluation conducted on circuit- and device-level simulations, we show that ORCHARD successfully performs practical image recognition tasks, including text, face, pedestrian, and vehicle recognition with 0.3% of accuracy loss made by computation approximation. In addition, our design significantly improves the performance and energy efficiency by up to 376x and 1896x, respectively, compared to the existing processor-based implementation.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"373 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116364542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
期刊
2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1