Jixuan Li;Ke Li;Ka-Fai Un;Wei-Han Yu;Rui P. Martins;Pui-In Mak
{"title":"An 800-MHz 8.17-TOPS/W 0.63-TOPS/mm2 Memory-Utilization-Aware CNN Accelerator Featuring a Memory Stationary Dataflow","authors":"Jixuan Li;Ke Li;Ka-Fai Un;Wei-Han Yu;Rui P. Martins;Pui-In Mak","doi":"10.1109/JSSC.2025.3532544","DOIUrl":null,"url":null,"abstract":"Increasing the on-chip memory utilization (OCMU) is crucial for an area-efficient deep neural network accelerator. We propose a memory stationary (MS) dataflow to ingeniously combine the input and output features in a single memory block in a cyclic manner, significantly increasing the OCMU. The MS dataflow also reduces the feature memory access by 78.0%. Furthermore, residual paths in a ResNet model require large feature buffering. We introduce layer-wise clipped-asymmetric residual distillation (LCARD) quantization, removing the residual paths with minimal accuracy degradation. It dynamically assigns different feature/weight bit-widths for different layers, further enhancing the OCMU by <inline-formula> <tex-math>$3.2{\\times }$ </tex-math></inline-formula> and throughput by <inline-formula> <tex-math>$4.5{\\times }$ </tex-math></inline-formula> from a fixed bit-width (FBW) approach. We also present an MS gating (MSG) to skip the ineffective channels that improve the OCMU by <inline-formula> <tex-math>$1.2{\\times }$ </tex-math></inline-formula> and throughput by <inline-formula> <tex-math>$1.3{\\times }$ </tex-math></inline-formula>. Fabricated in a 28-nm CMOS process, the proposed accelerator exhibits an 8.17-TOPS/W peak energy efficiency and a 0.63-TOPS/mm2 peak area efficiency at 800 MHz and 0.9 V while requiring only a 120 kB on-chip static random-access memory (SRAM) for the ResNet-50.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":"60 8","pages":"3033-3042"},"PeriodicalIF":5.6000,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10873361/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Increasing the on-chip memory utilization (OCMU) is crucial for an area-efficient deep neural network accelerator. We propose a memory stationary (MS) dataflow to ingeniously combine the input and output features in a single memory block in a cyclic manner, significantly increasing the OCMU. The MS dataflow also reduces the feature memory access by 78.0%. Furthermore, residual paths in a ResNet model require large feature buffering. We introduce layer-wise clipped-asymmetric residual distillation (LCARD) quantization, removing the residual paths with minimal accuracy degradation. It dynamically assigns different feature/weight bit-widths for different layers, further enhancing the OCMU by $3.2{\times }$ and throughput by $4.5{\times }$ from a fixed bit-width (FBW) approach. We also present an MS gating (MSG) to skip the ineffective channels that improve the OCMU by $1.2{\times }$ and throughput by $1.3{\times }$ . Fabricated in a 28-nm CMOS process, the proposed accelerator exhibits an 8.17-TOPS/W peak energy efficiency and a 0.63-TOPS/mm2 peak area efficiency at 800 MHz and 0.9 V while requiring only a 120 kB on-chip static random-access memory (SRAM) for the ResNet-50.
期刊介绍:
The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.