A 22 nm Floating-Point ReRAM Compute-in-Memory Macro Using Residue-Shared ADC for AI Edge Device

IF 5.6 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal of Solid-state Circuits Pub Date : 2024-10-22 DOI:10.1109/JSSC.2024.3470211

Hung-Hsi Hsu;Tai-Hao Wen;Win-San Khwa;Wei-Hsing Huang;Zhao-En Ke;Yu-Hsiang Chin;Hua-Jin Wen;Yu-Chen Chang;Wei-Ting Hsu;Ashwin Sanjay Lele;Bo Zhang;Ping-Sheng Wu;Chung-Chuan Lo;Ren-Shuo Liu;Chih-Cheng Hsieh;Kea-Tiong Tang;Shih-Hsin Teng;Chung-Cheng Chou;Yu-Der Chih;Tsung-Yung Jonathan Chang;Meng-Fan Chang

{"title":"A 22 nm Floating-Point ReRAM Compute-in-Memory Macro Using Residue-Shared ADC for AI Edge Device","authors":"Hung-Hsi Hsu;Tai-Hao Wen;Win-San Khwa;Wei-Hsing Huang;Zhao-En Ke;Yu-Hsiang Chin;Hua-Jin Wen;Yu-Chen Chang;Wei-Ting Hsu;Ashwin Sanjay Lele;Bo Zhang;Ping-Sheng Wu;Chung-Chuan Lo;Ren-Shuo Liu;Chih-Cheng Hsieh;Kea-Tiong Tang;Shih-Hsin Teng;Chung-Cheng Chou;Yu-Der Chih;Tsung-Yung Jonathan Chang;Meng-Fan Chang","doi":"10.1109/JSSC.2024.3470211","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) edge devices increasingly require the enhanced accuracy of floating-point (FP) multiply-and-accumulate (MAC) operations as well as nonvolatile on-chip memory to minimize the movement of weight data in power-off mode. Designing non-volatile compute-in-memory (nvCIM) macros for FP operations imposes several challenges, including: 1) a tradeoff between inference accuracy and weight bit-width following pre-alignment; 2) long computing latency and high energy consumption; 3) large cell array current during computation; and 4) high multi-bit readout energy consumption. In this study, we devised four schemes to address these issues, including: 1) a kernel-wise weight pre-alignment (K-WPA); 2) a rescheduled multi-bit input compression (RS-MIC); 3) HRS-favored dual-sign-bit (HF-DSB); and 4) residue-shared analog-to-digital converter (RS-ADC). A 16 Mb resistive random access memory (ReRAM) nvCIM macro fabricated for FP operations using foundry-provided ReRAM (22 nm CMOS technology) achieved an efficiency of 34.2 TFLOPS/W under BF16-input, BF16-weight, and FP32-output and 31.4 TFLOPS/W under FP16-input, FP16-weight, and FP32-output.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":"60 1","pages":"171-183"},"PeriodicalIF":5.6000,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10726927/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Artificial intelligence (AI) edge devices increasingly require the enhanced accuracy of floating-point (FP) multiply-and-accumulate (MAC) operations as well as nonvolatile on-chip memory to minimize the movement of weight data in power-off mode. Designing non-volatile compute-in-memory (nvCIM) macros for FP operations imposes several challenges, including: 1) a tradeoff between inference accuracy and weight bit-width following pre-alignment; 2) long computing latency and high energy consumption; 3) large cell array current during computation; and 4) high multi-bit readout energy consumption. In this study, we devised four schemes to address these issues, including: 1) a kernel-wise weight pre-alignment (K-WPA); 2) a rescheduled multi-bit input compression (RS-MIC); 3) HRS-favored dual-sign-bit (HF-DSB); and 4) residue-shared analog-to-digital converter (RS-ADC). A 16 Mb resistive random access memory (ReRAM) nvCIM macro fabricated for FP operations using foundry-provided ReRAM (22 nm CMOS technology) achieved an efficiency of 34.2 TFLOPS/W under BF16-input, BF16-weight, and FP32-output and 31.4 TFLOPS/W under FP16-input, FP16-weight, and FP32-output.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用残差共享 ADC 的 22 纳米浮点 ReRAM 内存计算宏，用于人工智能边缘设备

人工智能（AI）边缘设备越来越需要提高浮点（FP）乘法和累积（MAC）操作的准确性，以及非易失性片上存储器，以最大限度地减少断电模式下权重数据的移动。为FP操作设计非易失性内存中计算（nvCIM）宏带来了几个挑战，包括：1)在预对齐后的推理精度和权重位宽度之间进行权衡；2)计算延迟长，能耗高；3)计算时单元阵列电流大；4)多比特读出能耗高。在本研究中，我们设计了四种方案来解决这些问题，包括：1)核加权预校准（K-WPA）；2)重新调度的多比特输入压缩（RS-MIC）；3) hrs青睐的双符号位（HF-DSB）；4)剩余共享模数转换器（RS-ADC）。采用代工提供的ReRAM （22 nm CMOS技术）制造FP操作的16mb电阻随机存储器（ReRAM） nvCIM宏，在bf16输入、bf16重量和fp32输出下的效率为34.2 TFLOPS/W，在fp16输入、fp16重量和fp32输出下的效率为31.4 TFLOPS/W。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Journal of Solid-state Circuits 工程技术-工程：电子与电气

CiteScore

11.00

自引率

20.40%

发文量

351

审稿时长

3-6 weeks

期刊介绍： The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.