{"title":"Physically Tightly Coupled, Logically Loosely Coupled, Near-Memory BNN Accelerator (PTLL-BNN)","authors":"Yun-Chen Lo, Yu-Chun Kuo, Yun-Sheng Chang, Jian-Hao Huang, Jun-Shen Wu, Wen-Chien Ting, Tai-Hsing Wen, Ren-Shuo Liu","doi":"10.1109/ESSCIRC.2019.8902909","DOIUrl":null,"url":null,"abstract":"In this paper, a physically tightly coupled, logically loosely coupled, near-memory binary neural network accelerator (PTLL-BNN) is designed and fabricated. Both architecture-level and circuit-level optimizations are presented. From the perspective of processor architecture, the PTLL-BNN includes two new design choices. First, the proposed BNN accelerator is placed close to the SRAM of the embedded processors (i.e., physically tightly coupled and near-memory); thus, the extra SRAM cost that is incurred by the accelerator is as low as 0.5 KB. Second, the accelerator is a memory-mapped IO (MMIO) device (i.e., logically loosely coupled), so all embedded processors can be equipped with the proposed accelerator without the burden of changing their compilers and pipelines. From the circuit perspective, this work employs four techniques to optimize the power and costs of the accelerator. First, this design adopts a unified input-kernel-output memory instead of separate ones, which many previous works adopt. Second, the data layout that this work chooses increases the sequentiality of the SRAM accesses and reduces the buffer size of storing the intermediate values. Third, this work innovatively proposes to fuse the max-pooling, batch-normalization, and binarization layers of the BNNs to significantly reduce the hardware complexity. Finally, a novel methodology of generating the scheduler hardware of the accelerator is included. We fabricate the accelerator using the TSMC 180 nm technology. The chip measurement results reach 91 GOP/s on average (307 GOP/s at peak) at 200 MHz. The achieved GOP/s per million logic gates and GOP/s per KB SRAM are 2.6 to 237 times greater than that of previous works, respectively. We also realize an FPGA system to demonstrate the recognition of CIFAR-10/100 images using the fabricated accelerator.","PeriodicalId":402948,"journal":{"name":"ESSCIRC 2019 - IEEE 45th European Solid State Circuits Conference (ESSCIRC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESSCIRC 2019 - IEEE 45th European Solid State Circuits Conference (ESSCIRC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESSCIRC.2019.8902909","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
In this paper, a physically tightly coupled, logically loosely coupled, near-memory binary neural network accelerator (PTLL-BNN) is designed and fabricated. Both architecture-level and circuit-level optimizations are presented. From the perspective of processor architecture, the PTLL-BNN includes two new design choices. First, the proposed BNN accelerator is placed close to the SRAM of the embedded processors (i.e., physically tightly coupled and near-memory); thus, the extra SRAM cost that is incurred by the accelerator is as low as 0.5 KB. Second, the accelerator is a memory-mapped IO (MMIO) device (i.e., logically loosely coupled), so all embedded processors can be equipped with the proposed accelerator without the burden of changing their compilers and pipelines. From the circuit perspective, this work employs four techniques to optimize the power and costs of the accelerator. First, this design adopts a unified input-kernel-output memory instead of separate ones, which many previous works adopt. Second, the data layout that this work chooses increases the sequentiality of the SRAM accesses and reduces the buffer size of storing the intermediate values. Third, this work innovatively proposes to fuse the max-pooling, batch-normalization, and binarization layers of the BNNs to significantly reduce the hardware complexity. Finally, a novel methodology of generating the scheduler hardware of the accelerator is included. We fabricate the accelerator using the TSMC 180 nm technology. The chip measurement results reach 91 GOP/s on average (307 GOP/s at peak) at 200 MHz. The achieved GOP/s per million logic gates and GOP/s per KB SRAM are 2.6 to 237 times greater than that of previous works, respectively. We also realize an FPGA system to demonstrate the recognition of CIFAR-10/100 images using the fabricated accelerator.