{"title":"A 22 nm Floating-Point ReRAM Compute-in-Memory Macro Using Residue-Shared ADC for AI Edge Device","authors":"Hung-Hsi Hsu;Tai-Hao Wen;Win-San Khwa;Wei-Hsing Huang;Zhao-En Ke;Yu-Hsiang Chin;Hua-Jin Wen;Yu-Chen Chang;Wei-Ting Hsu;Ashwin Sanjay Lele;Bo Zhang;Ping-Sheng Wu;Chung-Chuan Lo;Ren-Shuo Liu;Chih-Cheng Hsieh;Kea-Tiong Tang;Shih-Hsin Teng;Chung-Cheng Chou;Yu-Der Chih;Tsung-Yung Jonathan Chang;Meng-Fan Chang","doi":"10.1109/JSSC.2024.3470211","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) edge devices increasingly require the enhanced accuracy of floating-point (FP) multiply-and-accumulate (MAC) operations as well as nonvolatile on-chip memory to minimize the movement of weight data in power-off mode. Designing non-volatile compute-in-memory (nvCIM) macros for FP operations imposes several challenges, including: 1) a tradeoff between inference accuracy and weight bit-width following pre-alignment; 2) long computing latency and high energy consumption; 3) large cell array current during computation; and 4) high multi-bit readout energy consumption. In this study, we devised four schemes to address these issues, including: 1) a kernel-wise weight pre-alignment (K-WPA); 2) a rescheduled multi-bit input compression (RS-MIC); 3) HRS-favored dual-sign-bit (HF-DSB); and 4) residue-shared analog-to-digital converter (RS-ADC). A 16 Mb resistive random access memory (ReRAM) nvCIM macro fabricated for FP operations using foundry-provided ReRAM (22 nm CMOS technology) achieved an efficiency of 34.2 TFLOPS/W under BF16-input, BF16-weight, and FP32-output and 31.4 TFLOPS/W under FP16-input, FP16-weight, and FP32-output.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":"60 1","pages":"171-183"},"PeriodicalIF":5.6000,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10726927/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial intelligence (AI) edge devices increasingly require the enhanced accuracy of floating-point (FP) multiply-and-accumulate (MAC) operations as well as nonvolatile on-chip memory to minimize the movement of weight data in power-off mode. Designing non-volatile compute-in-memory (nvCIM) macros for FP operations imposes several challenges, including: 1) a tradeoff between inference accuracy and weight bit-width following pre-alignment; 2) long computing latency and high energy consumption; 3) large cell array current during computation; and 4) high multi-bit readout energy consumption. In this study, we devised four schemes to address these issues, including: 1) a kernel-wise weight pre-alignment (K-WPA); 2) a rescheduled multi-bit input compression (RS-MIC); 3) HRS-favored dual-sign-bit (HF-DSB); and 4) residue-shared analog-to-digital converter (RS-ADC). A 16 Mb resistive random access memory (ReRAM) nvCIM macro fabricated for FP operations using foundry-provided ReRAM (22 nm CMOS technology) achieved an efficiency of 34.2 TFLOPS/W under BF16-input, BF16-weight, and FP32-output and 31.4 TFLOPS/W under FP16-input, FP16-weight, and FP32-output.
期刊介绍:
The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.