Physically Tightly Coupled, Logically Loosely Coupled, Near-Memory BNN Accelerator (PTLL-BNN)

ESSCIRC 2019 - IEEE 45th European Solid State Circuits Conference (ESSCIRC) Pub Date : 2019-09-01 DOI:10.1109/ESSCIRC.2019.8902909

Yun-Chen Lo, Yu-Chun Kuo, Yun-Sheng Chang, Jian-Hao Huang, Jun-Shen Wu, Wen-Chien Ting, Tai-Hsing Wen, Ren-Shuo Liu

{"title":"Physically Tightly Coupled, Logically Loosely Coupled, Near-Memory BNN Accelerator (PTLL-BNN)","authors":"Yun-Chen Lo, Yu-Chun Kuo, Yun-Sheng Chang, Jian-Hao Huang, Jun-Shen Wu, Wen-Chien Ting, Tai-Hsing Wen, Ren-Shuo Liu","doi":"10.1109/ESSCIRC.2019.8902909","DOIUrl":null,"url":null,"abstract":"In this paper, a physically tightly coupled, logically loosely coupled, near-memory binary neural network accelerator (PTLL-BNN) is designed and fabricated. Both architecture-level and circuit-level optimizations are presented. From the perspective of processor architecture, the PTLL-BNN includes two new design choices. First, the proposed BNN accelerator is placed close to the SRAM of the embedded processors (i.e., physically tightly coupled and near-memory); thus, the extra SRAM cost that is incurred by the accelerator is as low as 0.5 KB. Second, the accelerator is a memory-mapped IO (MMIO) device (i.e., logically loosely coupled), so all embedded processors can be equipped with the proposed accelerator without the burden of changing their compilers and pipelines. From the circuit perspective, this work employs four techniques to optimize the power and costs of the accelerator. First, this design adopts a unified input-kernel-output memory instead of separate ones, which many previous works adopt. Second, the data layout that this work chooses increases the sequentiality of the SRAM accesses and reduces the buffer size of storing the intermediate values. Third, this work innovatively proposes to fuse the max-pooling, batch-normalization, and binarization layers of the BNNs to significantly reduce the hardware complexity. Finally, a novel methodology of generating the scheduler hardware of the accelerator is included. We fabricate the accelerator using the TSMC 180 nm technology. The chip measurement results reach 91 GOP/s on average (307 GOP/s at peak) at 200 MHz. The achieved GOP/s per million logic gates and GOP/s per KB SRAM are 2.6 to 237 times greater than that of previous works, respectively. We also realize an FPGA system to demonstrate the recognition of CIFAR-10/100 images using the fabricated accelerator.","PeriodicalId":402948,"journal":{"name":"ESSCIRC 2019 - IEEE 45th European Solid State Circuits Conference (ESSCIRC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESSCIRC 2019 - IEEE 45th European Solid State Circuits Conference (ESSCIRC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESSCIRC.2019.8902909","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

In this paper, a physically tightly coupled, logically loosely coupled, near-memory binary neural network accelerator (PTLL-BNN) is designed and fabricated. Both architecture-level and circuit-level optimizations are presented. From the perspective of processor architecture, the PTLL-BNN includes two new design choices. First, the proposed BNN accelerator is placed close to the SRAM of the embedded processors (i.e., physically tightly coupled and near-memory); thus, the extra SRAM cost that is incurred by the accelerator is as low as 0.5 KB. Second, the accelerator is a memory-mapped IO (MMIO) device (i.e., logically loosely coupled), so all embedded processors can be equipped with the proposed accelerator without the burden of changing their compilers and pipelines. From the circuit perspective, this work employs four techniques to optimize the power and costs of the accelerator. First, this design adopts a unified input-kernel-output memory instead of separate ones, which many previous works adopt. Second, the data layout that this work chooses increases the sequentiality of the SRAM accesses and reduces the buffer size of storing the intermediate values. Third, this work innovatively proposes to fuse the max-pooling, batch-normalization, and binarization layers of the BNNs to significantly reduce the hardware complexity. Finally, a novel methodology of generating the scheduler hardware of the accelerator is included. We fabricate the accelerator using the TSMC 180 nm technology. The chip measurement results reach 91 GOP/s on average (307 GOP/s at peak) at 200 MHz. The achieved GOP/s per million logic gates and GOP/s per KB SRAM are 2.6 to 237 times greater than that of previous works, respectively. We also realize an FPGA system to demonstrate the recognition of CIFAR-10/100 images using the fabricated accelerator.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

物理紧耦合，逻辑松耦合，近内存BNN加速器(PTLL-BNN)

本文设计并制作了一个物理紧耦合、逻辑松耦合的近记忆二进制神经网络加速器(PTLL-BNN)。给出了体系结构级和电路级的优化。从处理器体系结构的角度来看，PTLL-BNN包括两种新的设计选择。首先，提议的BNN加速器被放置在嵌入式处理器的SRAM附近(即物理紧耦合和近内存);因此，加速器产生的额外SRAM成本低至0.5 KB。其次，加速器是一个内存映射IO (MMIO)设备(即逻辑上松耦合)，因此所有嵌入式处理器都可以配备所建议的加速器，而无需更改其编译器和管道。从电路的角度来看，这项工作采用了四种技术来优化加速器的功率和成本。首先，本设计采用了统一的输入-核-输出存储器，而不是以前许多作品采用的单独的存储器。其次，本工作选择的数据布局增加了SRAM访问的顺序性，减少了存储中间值的缓冲区大小。第三，创新性地提出融合最大池化层、批处理归一化层和二值化层，显著降低了bnn的硬件复杂度。最后，提出了一种生成加速器调度程序硬件的新方法。我们使用台积电180纳米技术制造加速器。在200 MHz下，芯片测量结果平均达到91 GOP/s(峰值达到307 GOP/s)。实现的每百万逻辑门的GOP/s和每KB SRAM的GOP/s分别是以前的2.6 ~ 237倍。利用该加速器实现了对CIFAR-10/100图像识别的FPGA系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ESSCIRC 2019 - IEEE 45th European Solid State Circuits Conference (ESSCIRC)

自引率

0.00%

发文量