A real-time and energy-efficient SRAM with mixed-signal in-memory computing near CMOS sensors

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Journal of Real-Time Image Processing Pub Date : 2024-07-31 DOI:10.1007/s11554-024-01520-x

Jose-Angel Diaz-Madrid, Gines Domenech-Asensi, Ramon Ruiz-Merino, Juan-Francisco Zapata-Perez

{"title":"A real-time and energy-efficient SRAM with mixed-signal in-memory computing near CMOS sensors","authors":"Jose-Angel Diaz-Madrid, Gines Domenech-Asensi, Ramon Ruiz-Merino, Juan-Francisco Zapata-Perez","doi":"10.1007/s11554-024-01520-x","DOIUrl":null,"url":null,"abstract":"<p>In-memory computing (IMC) represents a promising approach to reducing latency and enhancing the energy efficiency of operations required for calculating convolution products of images. This study proposes a fully differential current-mode architecture for computing image convolutions across all four quadrants, intended for deep learning applications within CMOS imagers utilizing IMC near the CMOS sensor. This architecture processes analog signals provided by a CMOS sensor without the need for analog-to-digital conversion. Furthermore, it eliminates the necessity for data transfer between memory and analog operators as convolutions are computed within modified SRAM memory. The paper suggests modifying the structure of a CMOS SRAM cell by incorporating transistors capable of performing multiplications between binary (−1 or +1) weights and analog signals. Modified SRAM cells can be interconnected to sum the multiplication results obtained from individual cells. This approach facilitates connecting current inputs to different SRAM cells, offering highly scalable and parallelized calculations. For this study, a configurable module comprising nine modified SRAM cells with peripheral circuitry has been designed to calculate the convolution product on each pixel of an image using a <span>\\(3 \\times 3\\)</span> mask with binary values (−1 or 1). Subsequently, an IMC module has been designed to perform 16 convolution operations in parallel, with input currents shared among the 16 modules. This configuration enables the computation of 16 convolutions simultaneously, processing a column per cycle. A digital control circuit manages both the readout or memorization of digital weights, as well as the multiply and add operations in real-time. The architecture underwent testing by performing convolutions between binary masks of 3 × 3 values and images of 32 × 32 pixels to assess accuracy and scalability when two IMC modules are vertically integrated. Convolution weights are stored locally as 1-bit digital values. The circuit was synthesized in 180 nm CMOS technology, and simulation results indicate its capability to perform a complete convolution in 3.2 ms, achieving an efficiency of 11,522 1-b TOPS/W (1-b tera-operations per second per watt) with a similarity to ideal processing of 96%.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"12 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Real-Time Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11554-024-01520-x","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In-memory computing (IMC) represents a promising approach to reducing latency and enhancing the energy efficiency of operations required for calculating convolution products of images. This study proposes a fully differential current-mode architecture for computing image convolutions across all four quadrants, intended for deep learning applications within CMOS imagers utilizing IMC near the CMOS sensor. This architecture processes analog signals provided by a CMOS sensor without the need for analog-to-digital conversion. Furthermore, it eliminates the necessity for data transfer between memory and analog operators as convolutions are computed within modified SRAM memory. The paper suggests modifying the structure of a CMOS SRAM cell by incorporating transistors capable of performing multiplications between binary (−1 or +1) weights and analog signals. Modified SRAM cells can be interconnected to sum the multiplication results obtained from individual cells. This approach facilitates connecting current inputs to different SRAM cells, offering highly scalable and parallelized calculations. For this study, a configurable module comprising nine modified SRAM cells with peripheral circuitry has been designed to calculate the convolution product on each pixel of an image using a \(3 \times 3\) mask with binary values (−1 or 1). Subsequently, an IMC module has been designed to perform 16 convolution operations in parallel, with input currents shared among the 16 modules. This configuration enables the computation of 16 convolutions simultaneously, processing a column per cycle. A digital control circuit manages both the readout or memorization of digital weights, as well as the multiply and add operations in real-time. The architecture underwent testing by performing convolutions between binary masks of 3 × 3 values and images of 32 × 32 pixels to assess accuracy and scalability when two IMC modules are vertically integrated. Convolution weights are stored locally as 1-bit digital values. The circuit was synthesized in 180 nm CMOS technology, and simulation results indicate its capability to perform a complete convolution in 3.2 ms, achieving an efficiency of 11,522 1-b TOPS/W (1-b tera-operations per second per watt) with a similarity to ideal processing of 96%.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在 CMOS 传感器附近采用混合信号内存计算的实时高能效 SRAM

内存计算（IMC）是减少延迟和提高计算图像卷积所需的操作能效的一种有前途的方法。本研究提出了一种全差分电流模式架构，用于计算所有四个象限的图像卷积，旨在利用 CMOS 传感器附近的 IMC，在 CMOS 成像仪内实现深度学习应用。该架构处理 CMOS 传感器提供的模拟信号，无需进行模数转换。此外，由于卷积是在修改后的 SRAM 存储器中计算的，因此无需在存储器和模拟运算器之间进行数据传输。论文建议修改 CMOS SRAM 单元的结构，加入能够执行二进制（-1 或 +1）权重与模拟信号之间乘法运算的晶体管。修改后的 SRAM 单元可以相互连接，将单个单元的乘法结果相加。这种方法便于将电流输入连接到不同的 SRAM 单元，从而提供高度可扩展的并行计算。在这项研究中，设计了一个由九个带外围电路的改良 SRAM 单元组成的可配置模块，利用二进制值（-1 或 1）的 3 次掩码计算图像每个像素的卷积。随后，设计了一个 IMC 模块来并行执行 16 个卷积运算，16 个模块共享输入电流。这种配置可同时计算 16 个卷积，每个周期处理一列。数字控制电路可实时管理数字权重的读出或记忆，以及乘法和加法运算。该架构通过在 3 × 3 值的二进制掩码和 32 × 32 像素的图像之间进行卷积来进行测试，以评估两个 IMC 模块垂直集成时的精度和可扩展性。卷积权重在本地存储为 1 位数字值。电路采用 180 纳米 CMOS 技术合成，仿真结果表明它能在 3.2 毫秒内完成一次完整的卷积，效率达到 11,522 1-b TOPS/W（每秒每瓦 1-b 太字节运算），与理想处理的相似度为 96%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Real-Time Image Processing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

6.80

自引率

6.70%

发文量

审稿时长

6 months

期刊介绍： Due to rapid advancements in integrated circuit technology, the rich theoretical results that have been developed by the image and video processing research community are now being increasingly applied in practical systems to solve real-world image and video processing problems. Such systems involve constraints placed not only on their size, cost, and power consumption, but also on the timeliness of the image data processed. Examples of such systems are mobile phones, digital still/video/cell-phone cameras, portable media players, personal digital assistants, high-definition television, video surveillance systems, industrial visual inspection systems, medical imaging devices, vision-guided autonomous robots, spectral imaging systems, and many other real-time embedded systems. In these real-time systems, strict timing requirements demand that results are available within a certain interval of time as imposed by the application. It is often the case that an image processing algorithm is developed and proven theoretically sound, presumably with a specific application in mind, but its practical applications and the detailed steps, methodology, and trade-off analysis required to achieve its real-time performance are not fully explored, leaving these critical and usually non-trivial issues for those wishing to employ the algorithm in a real-time system. The Journal of Real-Time Image Processing is intended to bridge the gap between the theory and practice of image processing, serving the greater community of researchers, practicing engineers, and industrial professionals who deal with designing, implementing or utilizing image processing systems which must satisfy real-time design constraints.