首页 > 最新文献

Memories - Materials, Devices, Circuits and Systems最新文献

英文 中文
A subranging nonuniform sampling memristive neural network-based analog-to-digital converter 一种基于子范围非均匀采样忆阻神经网络的模数转换器
Pub Date : 2023-07-01 DOI: 10.1016/j.memori.2023.100038
Hao You , Amirali Amirsoleimani , Jianxiong Xu , Mostafa Rahimi Azghadi , Roman Genov

This work presents a novel 4-bit subranging nonuniform sampling (NUS) memristive neural network-based analog-to-digital converter (ADC) with improved performance trade-off among speed, power, area, and accuracy. The proposed design preserves the memristive neural network calibration and utilizes a trainable memristor weight to adapt to device mismatch and increase accuracy. Rather than conventional binary searching, we adopt quaternary searching in the ADC to realize subranging architecture’s coarse and fine bits determination. A level-crossing nonuniform sampling (NUS) is introduced to the proposed ADC to enhance the ENOB under the same resolutions, power, and area consumption. Area and power consumption are reduced through circuit sharing between different stages of bit determination. The proposed 4-bit ADC achieves a highest ENOB of 5.96 and 5.6 at cut-off frequency (128 MHz) with power consumption of 0.515 mW and a figure of merit (FoM) of 82.95 fJ/conv.

这项工作提出了一种新的4位子范围非均匀采样(NUS)忆阻神经网络模数转换器(ADC),该转换器在速度、功率、面积和精度之间具有改进的性能权衡。所提出的设计保留了忆阻神经网络校准,并利用可训练的忆阻器权重来适应器件失配并提高精度。与传统的二进制搜索不同,我们在ADC中采用了四进制搜索来实现子范围结构的粗、细比特确定。在所提出的ADC中引入了电平交叉非均匀采样(NUS),以在相同的分辨率、功率和面积消耗下增强ENOB。通过比特确定的不同阶段之间的电路共享来减少面积和功耗。所提出的4位ADC在截止频率(128MHz)下实现了5.96和5.6的最高ENOB,功耗为0.515mW,品质因数(FoM)为82.95fJ/conv。
{"title":"A subranging nonuniform sampling memristive neural network-based analog-to-digital converter","authors":"Hao You ,&nbsp;Amirali Amirsoleimani ,&nbsp;Jianxiong Xu ,&nbsp;Mostafa Rahimi Azghadi ,&nbsp;Roman Genov","doi":"10.1016/j.memori.2023.100038","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100038","url":null,"abstract":"<div><p>This work presents a novel 4-bit subranging nonuniform sampling (NUS) memristive neural network-based analog-to-digital converter (ADC) with improved performance trade-off among speed, power, area, and accuracy. The proposed design preserves the memristive neural network calibration and utilizes a trainable memristor weight to adapt to device mismatch and increase accuracy. Rather than conventional binary searching, we adopt quaternary searching in the ADC to realize subranging architecture’s coarse and fine bits determination. A level-crossing nonuniform sampling (NUS) is introduced to the proposed ADC to enhance the ENOB under the same resolutions, power, and area consumption. Area and power consumption are reduced through circuit sharing between different stages of bit determination. The proposed 4-bit ADC achieves a highest ENOB of 5.96 and 5.6 at cut-off frequency (128 <span><math><mi>MHz</mi></math></span>) with power consumption of 0.515 <span><math><mi>mW</mi></math></span> and a figure of merit (FoM) of 82.95 <span><math><mi>fJ/conv</mi></math></span>.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100038"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Flux-charge analysis and experimental verification of a parallel Memristor–Capacitor circuit 并联忆阻器-电容器电路的通量电荷分析与实验验证
Pub Date : 2023-07-01 DOI: 10.1016/j.memori.2023.100043
M.A. Carrasco-Aguilar, F.E. Morales-López, C. Sánchez-López, Rocio Ochoa-Montiel

In this article, the flux-charge analysis method is applied to obtain the theoretical response of the voltage generated in a parallel Memristor–Capacitor (M–C) circuit excited by an input pulse generator with a 100 kHz frequency, 5 V amplitude and a 50 ohms output impedance. The theoretical solution of the nonlinear ordinary differential equation that results when applying the method is reached by a numerical method. As a memristive circuit, a previously reported floating memristor emulator was used. The response obtained is compared with the experimental response, generating evidence that the applied analysis method yields an acceptable margin of error with regards to the experimental results obtained, contrasting with other similar reports, where the analyzes are based on theoretical memristive models, and show simulation results only. Summary, the paper would contribute to the analysis and experimental verification of the parallel M–C circuit subjected to a real switched exciting source, using a memristance equation established in an emulator that is different from the equations commonly used in the literature.

在本文中,应用通量电荷分析方法来获得由频率为100 kHz、振幅为5 V、输出阻抗为50欧姆的输入脉冲发生器激励的并联忆阻器-电容器(M–C)电路中产生的电压的理论响应。应用该方法得到的非线性常微分方程的理论解是通过数值方法得到的。作为忆阻电路,使用了先前报道的浮动忆阻器模拟器。将所获得的响应与实验响应进行比较,从而证明所应用的分析方法相对于所获得的实验结果产生了可接受的误差范围,与其他类似报告形成对比,在其他类似报告中,分析基于理论忆阻模型,仅显示模拟结果。总之,本文将使用模拟器中建立的与文献中常用的方程不同的忆阻方程,对实际开关激励源下的并联M–C电路进行分析和实验验证。
{"title":"Flux-charge analysis and experimental verification of a parallel Memristor–Capacitor circuit","authors":"M.A. Carrasco-Aguilar,&nbsp;F.E. Morales-López,&nbsp;C. Sánchez-López,&nbsp;Rocio Ochoa-Montiel","doi":"10.1016/j.memori.2023.100043","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100043","url":null,"abstract":"<div><p>In this article, the flux-charge analysis method is applied to obtain the theoretical response of the voltage generated in a parallel Memristor–Capacitor (M–C) circuit excited by an input pulse generator with a 100 kHz frequency, 5 V amplitude and a 50 ohms output impedance. The theoretical solution of the nonlinear ordinary differential equation that results when applying the method is reached by a numerical method. As a memristive circuit, a previously reported floating memristor emulator was used. The response obtained is compared with the experimental response, generating evidence that the applied analysis method yields an acceptable margin of error with regards to the experimental results obtained, contrasting with other similar reports, where the analyzes are based on theoretical memristive models, and show simulation results only. Summary, the paper would contribute to the analysis and experimental verification of the parallel M–C circuit subjected to a real switched exciting source, using a memristance equation established in an emulator that is different from the equations commonly used in the literature.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100043"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A review on computational storage devices and near memory computing for high performance applications 用于高性能应用的计算存储设备和近内存计算综述
Pub Date : 2023-07-01 DOI: 10.1016/j.memori.2023.100051
Dina Fakhry , Mohamed Abdelsalam , M. Watheq El-Kharashi , Mona Safar

The von Neumann bottleneck is imposed due to the explosion of data transfers and emerging data-intensive applications in heterogeneous system architectures. The conventional computation approach of transferring data to CPU is no longer suitable especially with the cost it imposes. Given the increasing storage capacities, moving extensive data volumes between storage and computation cannot scale up. Hence, high-performance data processing mechanisms are needed, which may be achieved by bringing computation closer to data. Gathering insights where data is stored helps deal with energy efficiency, low latency, as well as security. Storage bus bandwidth is also saved when only computation results are delivered to the host memory. Various applications, including database acceleration, machine learning, Artificial Intelligence (AI), offloading (compression/encryption/encoding) and others can perform better and become more scalable if the “move process to data” paradigm is applied. Embedding processing engines inside Solid-State Drives (SSDs), transforming them to Computational Storage Devices (CSDs), provides the needed data processing solution. In this paper, we review the prior art on Near Data Processing (NDP) with focus on In-Storage Computing (ISC), identifying main challenges and potential gaps for future research directions.

von Neumann瓶颈是由于异构系统体系结构中数据传输的爆炸性增长和新兴的数据密集型应用而造成的。将数据传输到CPU的传统计算方法不再适用,尤其是考虑到它带来的成本。考虑到存储容量的不断增加,在存储和计算之间移动大量数据量无法扩大规模。因此,需要高性能的数据处理机制,这可以通过使计算更接近数据来实现。收集数据存储位置的见解有助于处理能源效率、低延迟以及安全问题。当只有计算结果被传送到主机存储器时,存储总线带宽也被节省。如果应用“向数据转移过程”范式,包括数据库加速、机器学习、人工智能(AI)、卸载(压缩/加密/编码)等在内的各种应用程序可以表现得更好,并变得更具可扩展性。将处理引擎嵌入固态硬盘(SSD)中,将其转换为计算存储设备(CSD),提供了所需的数据处理解决方案。在本文中,我们回顾了近数据处理(NDP)的现有技术,重点是存储内计算(ISC),确定了未来研究方向的主要挑战和潜在差距。
{"title":"A review on computational storage devices and near memory computing for high performance applications","authors":"Dina Fakhry ,&nbsp;Mohamed Abdelsalam ,&nbsp;M. Watheq El-Kharashi ,&nbsp;Mona Safar","doi":"10.1016/j.memori.2023.100051","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100051","url":null,"abstract":"<div><p>The von Neumann bottleneck is imposed due to the explosion of data transfers and emerging data-intensive applications in heterogeneous system architectures. The conventional computation approach of transferring data to CPU is no longer suitable especially with the cost it imposes. Given the increasing storage capacities, moving extensive data volumes between storage and computation cannot scale up. Hence, high-performance data processing mechanisms are needed, which may be achieved by bringing computation closer to data. Gathering insights where data is stored helps deal with energy efficiency, low latency, as well as security. Storage bus bandwidth is also saved when only computation results are delivered to the host memory. Various applications, including database acceleration, machine learning, Artificial Intelligence (AI), offloading (compression/encryption/encoding) and others can perform better and become more scalable if the “move process to data” paradigm is applied. Embedding processing engines inside Solid-State Drives (SSDs), transforming them to Computational Storage Devices (CSDs), provides the needed data processing solution. In this paper, we review the prior art on Near Data Processing (NDP) with focus on In-Storage Computing (ISC), identifying main challenges and potential gaps for future research directions.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100051"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50200166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Realization of multi-mode universal shadow filter and its application as a frequency-hopping filter 多模通用阴影滤波器的实现及其作为跳频滤波器的应用
Pub Date : 2023-07-01 DOI: 10.1016/j.memori.2023.100049
Divya Singh, Sajal K. Paul

This work presents a new active block called the differential current conveyor cascaded transconductance amplifier (DCCCTA) and implemented multi-mode biquadratic universal shadow filter. The frequency-hopping filter is implemented using a multi-mode universal shadow filter. The proposed circuit has two modes of operation: current mode (CM) and transadmittance mode (TAM). All-pass (AP), band-pass (BP), band-reject (BR), high-pass (HP), and low-pass (LP) responses are simultaneously accomplished. As intended, low input impedance for CM and high input impedance for TAM are acquired, while high output impedance is attained for both modes of operation. Inter-modulation distortion (IMD), percentage total harmonic distortion (%THD), and Monte Carlo analysis are also obtained. The theoretical results are verified using Cadence Virtuoso in 180 nm TSMC technology.

本文提出了一种新的有源块,称为差分电流传送器级联跨导放大器(DCCCTA),并实现了多模双二次通用阴影滤波器。跳频滤波器使用多模通用阴影滤波器来实现。所提出的电路有两种工作模式:电流模式(CM)和跨导通模式(TAM)。全通(AP)、带通(BP)、带阻(BR)、高通(HP)和低通(LP)响应同时完成。正如预期的那样,CM的低输入阻抗和TAM的高输入阻抗被获得,而两种操作模式都获得了高输出阻抗。还获得了互调失真(IMD)、总谐波失真百分比(THD)和蒙特卡罗分析。用Cadence Virtuoso在180nm台积电技术中验证了理论结果。
{"title":"Realization of multi-mode universal shadow filter and its application as a frequency-hopping filter","authors":"Divya Singh,&nbsp;Sajal K. Paul","doi":"10.1016/j.memori.2023.100049","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100049","url":null,"abstract":"<div><p>This work presents a new active block called the differential current conveyor cascaded transconductance amplifier (DCCCTA) and implemented multi-mode biquadratic universal shadow filter. The frequency-hopping filter is implemented using a multi-mode universal shadow filter. The proposed circuit has two modes of operation: current mode (CM) and transadmittance mode (TAM). All-pass (AP), band-pass (BP), band-reject (BR), high-pass (HP), and low-pass (LP) responses are simultaneously accomplished. As intended, low input impedance for CM and high input impedance for TAM are acquired, while high output impedance is attained for both modes of operation. Inter-modulation distortion (IMD), percentage total harmonic distortion (%THD), and Monte Carlo analysis are also obtained. The theoretical results are verified using Cadence Virtuoso in 180 nm TSMC technology.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100049"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reconfigurable optoelectronic absorber based on nested nano disk-ribbon graphene Pattern in THz range 太赫兹范围内基于嵌套纳米盘带石墨烯图案的可重构光电子吸收器
Pub Date : 2023-07-01 DOI: 10.1016/j.memori.2023.100039
Ilghar Rezaei , Ava Salmanpour , Toktam Aghaee

A two-layers, multi-band super absorber with the capability of being tuned is proposed in this paper. The idea behind the design is to realize periodic arrays of graphene disks via graphene ribbons with different lengths. Then circuit modeling is developed to be used alongside the impedance matching concept to achieve more than ten absorption peaks. The exploited spacer is a lossless polymer in the THz frequency range while the bottom of the device is occupied by a relatively thick golden plate. The developed circuit model description is verified by full-wave simulation. According to the simulation results, the proposed absorber shows more than ten peaks with absorption over 90%. The peak frequencies are interestingly able to be shifted via exploited single chemical potential variations. Additionally, deviations of absorber response against graphene electron relaxation time and device geometry are shown to be marginal which makes the presented meta-absorber, a reliable optical device.

本文提出了一种具有调谐能力的双层多频带超级吸收体。该设计背后的想法是通过不同长度的石墨烯带实现石墨烯盘的周期性阵列。然后,电路建模被开发为与阻抗匹配概念一起使用,以实现十多个吸收峰值。所开发的间隔物是太赫兹频率范围内的无损聚合物,而器件的底部被相对较厚的金片占据。通过全波仿真验证了所开发的电路模型描述。根据模拟结果,所提出的吸收体显示出十多个吸收率超过90%的峰。有趣的是,峰值频率能够通过利用的单个化学势变化来移动。此外,吸收体响应相对于石墨烯电子弛豫时间和器件几何形状的偏差被证明是边际的,这使得所提出的元吸收体成为一种可靠的光学器件。
{"title":"Reconfigurable optoelectronic absorber based on nested nano disk-ribbon graphene Pattern in THz range","authors":"Ilghar Rezaei ,&nbsp;Ava Salmanpour ,&nbsp;Toktam Aghaee","doi":"10.1016/j.memori.2023.100039","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100039","url":null,"abstract":"<div><p>A two-layers, multi-band super absorber with the capability of being tuned is proposed in this paper. The idea behind the design is to realize periodic arrays of graphene disks via graphene ribbons with different lengths. Then circuit modeling is developed to be used alongside the impedance matching concept to achieve more than ten absorption peaks. The exploited spacer is a lossless polymer in the THz frequency range while the bottom of the device is occupied by a relatively thick golden plate. The developed circuit model description is verified by full-wave simulation. According to the simulation results, the proposed absorber shows more than ten peaks with absorption over 90%. The peak frequencies are interestingly able to be shifted via exploited single chemical potential variations. Additionally, deviations of absorber response against graphene electron relaxation time and device geometry are shown to be marginal which makes the presented meta-absorber, a reliable optical device.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100039"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Improvement of memory performance of 3-D NAND flash memory with retrograde channel doping 反向沟道掺杂改善三维NAND闪存的存储性能
Pub Date : 2023-07-01 DOI: 10.1016/j.memori.2023.100031
Deepika Gupta , Abhishek Kumar Upadhyay , Ankur Beohar , Santosh Kumar Vishvakarma

The examination of the effect of retrograde channel doping on reliability and performance of 3-D junction-free NAND based flash memory is done for this paper. Specifically, we study the program characteristics, data retention capability junction-free NAND flash memory with half pitch range from 35 nm to 12 nm. Based on our analysis, we highlight that the retrograde channel doping approach can improve not only the SCEs but also the program speed and data control time for 3-D junction-free NAND flash memory, without varying the oxide stack in charge trap-based flash memory.

本文研究了反向沟道掺杂对三维无结NAND闪存可靠性和性能的影响。具体而言,我们研究了半间距范围为35nm至12nm的无结NAND闪存的程序特性、数据保持能力。基于我们的分析,我们强调,在不改变基于电荷陷阱的闪存中的氧化物堆叠的情况下,反向沟道掺杂方法不仅可以提高无结三维NAND闪存的SCE,还可以提高编程速度和数据控制时间。
{"title":"Improvement of memory performance of 3-D NAND flash memory with retrograde channel doping","authors":"Deepika Gupta ,&nbsp;Abhishek Kumar Upadhyay ,&nbsp;Ankur Beohar ,&nbsp;Santosh Kumar Vishvakarma","doi":"10.1016/j.memori.2023.100031","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100031","url":null,"abstract":"<div><p>The examination of the effect of retrograde channel doping on reliability and performance of 3-D junction-free NAND based flash memory is done for this paper. Specifically, we study the program characteristics, data retention capability junction-free NAND flash memory with half pitch range from 35 nm to 12 nm. Based on our analysis, we highlight that the retrograde channel doping approach can improve not only the SCEs but also the program speed and data control time for 3-D junction-free NAND flash memory, without varying the oxide stack in charge trap-based flash memory.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100031"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FourierPIM: High-throughput in-memory Fast Fourier Transform and polynomial multiplication FourierPIM:高吞吐量内存快速傅立叶变换和多项式乘法
Pub Date : 2023-07-01 DOI: 10.1016/j.memori.2023.100034
Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar Kvatinsky

The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive O(n2) to O(nlogn), and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators such as GPUs cannot exploit their full computing capabilities since memory access becomes the bottleneck. Therefore, this paper accelerates the FFT algorithm using digital Processing-in-Memory (PIM) architectures that shift computation into the memory by exploiting physical devices capable of both storage and logic (e.g., memristors). We propose an O(logn) in-memory FFT algorithm that can also be performed in parallel across multiple arrays for high-throughput batched execution, supporting both fixed-point and floating-point numbers. Through the convolution theorem, we extend this algorithm to O(logn) polynomial multiplication – a fundamental task for applications such as cryptography. We evaluate FourierPIM on a publicly-available cycle-accurate simulator that verifies both correctness and performance, and demonstrate 5–15× throughput and 4–13× energy improvement over the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial multiplication.

离散傅立叶变换(DFT)对于从信号处理到卷积和多项式乘法的各种应用是必不可少的。突破性的快速傅立叶变换(FFT)算法将DFT时间复杂性从最初的O(n2)降低到O(nlogn),最近的工作通过GPU等并行架构寻求进一步的加速。不幸的是,GPU等加速器无法充分利用其计算能力,因为内存访问成为瓶颈。因此,本文使用存储器中的数字处理(PIM)架构来加速FFT算法,该架构通过利用既能存储又能逻辑的物理设备(例如忆阻器)将计算转移到存储器中。我们提出了一种内存中的O(logn)FFT算法,该算法也可以在多个阵列之间并行执行,以实现高吞吐量的批处理执行,同时支持定点和浮点数。通过卷积定理,我们将该算法扩展到O(logn)多项式乘法——这是密码学等应用的基本任务。我们在一个公开可用的周期精确模拟器上评估了FourierPIM,该模拟器验证了正确性和性能,并在用于FFT和多项式乘法的最先进GPU上证明了与NVIDIA cuFFT库相比,吞吐量提高了5–15倍,能量提高了4–13倍。
{"title":"FourierPIM: High-throughput in-memory Fast Fourier Transform and polynomial multiplication","authors":"Orian Leitersdorf,&nbsp;Yahav Boneh,&nbsp;Gonen Gazit,&nbsp;Ronny Ronen,&nbsp;Shahar Kvatinsky","doi":"10.1016/j.memori.2023.100034","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100034","url":null,"abstract":"<div><p>The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>n</mi><mo>log</mo><mi>n</mi><mo>)</mo></mrow></mrow></math></span>, and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators such as GPUs cannot exploit their full computing capabilities since memory access becomes the bottleneck. Therefore, this paper accelerates the FFT algorithm using digital Processing-in-Memory (PIM) architectures that shift computation into the memory by exploiting physical devices capable of both storage and logic (e.g., memristors). We propose an <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mo>log</mo><mi>n</mi><mo>)</mo></mrow></mrow></math></span> in-memory FFT algorithm that can also be performed in parallel across multiple arrays for <em>high-throughput batched execution</em>, supporting both fixed-point and floating-point numbers. Through the convolution theorem, we extend this algorithm to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mo>log</mo><mi>n</mi><mo>)</mo></mrow></mrow></math></span> polynomial multiplication – a fundamental task for applications such as cryptography. We evaluate FourierPIM on a publicly-available cycle-accurate simulator that verifies both correctness and performance, and demonstrate <span><math><mrow><mtext>5–15</mtext><mo>×</mo></mrow></math></span> throughput and <span><math><mrow><mtext>4–13</mtext><mo>×</mo></mrow></math></span> energy improvement over the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial multiplication.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100034"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A survey on processing-in-memory techniques: Advances and challenges 记忆加工技术研究进展与挑战
Pub Date : 2023-07-01 DOI: 10.1016/j.memori.2022.100022
Kazi Asifuzzaman, Narasinga Rao Miniskar, Aaron R. Young, Frank Liu, Jeffrey S. Vetter

Processing-in-memory (PIM) techniques have gained much attention from computer architecture researchers, and significant research effort has been invested in exploring and developing such techniques. Increasing the research activity dedicated to improving PIM techniques will hopefully help deliver PIM’s promise to solve or significantly reduce memory access bottleneck problems for memory-intensive applications. We also believe it is imperative to track the advances made in PIM research to identify open challenges and enable the research community to make informed decisions and adjust future research directions. In this survey, we analyze recent studies that explored PIM techniques, summarize the advances made, compare recent PIM architectures, and identify target application domains and suitable memory technologies. We also discuss proposals that address unresolved issues of PIM designs (e.g., address translation/mapping of operands, workload analysis to identify application segments that can be accelerated with PIM, OS/runtime support, and coherency issues that must be resolved to incorporate PIM). We believe this work can serve as a useful reference for researchers exploring PIM techniques.

内存处理(PIM)技术已经引起了计算机体系结构研究人员的广泛关注,并且在探索和开发这些技术方面投入了大量的研究工作。增加致力于改进PIM技术的研究活动,有望有助于实现PIM的承诺,解决或显著减少内存密集型应用程序的内存访问瓶颈问题。我们还认为,必须跟踪PIM研究的进展,以确定悬而未决的挑战,并使研究界能够做出明智的决定和调整未来的研究方向。在这项调查中,我们分析了最近探索PIM技术的研究,总结了所取得的进展,比较了最近的PIM架构,并确定了目标应用领域和合适的内存技术。我们还讨论了解决PIM设计中未解决问题的建议(例如,解决操作数的翻译/映射、确定可通过PIM加速的应用程序段的工作负载分析、操作系统/运行时支持,以及必须解决以纳入PIM的一致性问题)。我们相信这项工作可以为研究PIM技术的研究人员提供有用的参考。
{"title":"A survey on processing-in-memory techniques: Advances and challenges","authors":"Kazi Asifuzzaman,&nbsp;Narasinga Rao Miniskar,&nbsp;Aaron R. Young,&nbsp;Frank Liu,&nbsp;Jeffrey S. Vetter","doi":"10.1016/j.memori.2022.100022","DOIUrl":"https://doi.org/10.1016/j.memori.2022.100022","url":null,"abstract":"<div><p>Processing-in-memory (PIM) techniques have gained much attention from computer architecture researchers, and significant research effort has been invested in exploring and developing such techniques. Increasing the research activity dedicated to improving PIM techniques will hopefully help deliver PIM’s promise to solve or significantly reduce memory access bottleneck problems for memory-intensive applications. We also believe it is imperative to track the advances made in PIM research to identify open challenges and enable the research community to make informed decisions and adjust future research directions. In this survey, we analyze recent studies that explored PIM techniques, summarize the advances made, compare recent PIM architectures, and identify target application domains and suitable memory technologies. We also discuss proposals that address unresolved issues of PIM designs (e.g., address translation/mapping of operands, workload analysis to identify application segments that can be accelerated with PIM, OS/runtime support, and coherency issues that must be resolved to incorporate PIM). We believe this work can serve as a useful reference for researchers exploring PIM techniques.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100022"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50200136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Voltage Reduced Self Refresh (VRSR) for optimized energy savings in DRAM Memories 降低电压自刷新(VRSR)优化DRAM存储器的节能
Pub Date : 2023-07-01 DOI: 10.1016/j.memori.2023.100058
Diyanesh Chinnakkonda , Venkata Kalyan Tavva , M.B. Srinivas

Modern computing systems demand DRAMs with more capacity and bandwidth to keep pace with the onslaught of new data-intensive applications. Though DRAM scaling offers higher density devices to realize high memory capacity systems, energy consumption has become a key design limiter. This is owing to the fact that the memory sub-system continues to be responsible for a significant fraction of overall system energy. Self-refresh mode is one low power state that consumes the least DRAM energy, and this is an essential operation to avoid data loss. However, self-refresh energy also continues to grow with density scaling. This paper carries out a detailed study of reducing self-refresh energy by reducing the supply voltage. PARSEC benchmarks in Gem5 full-system mode are used to quantify the merit of self-refresh energy savings at reduced voltages for normal, reduced, and extended temperature ranges. The latency impacts of basic operations involved in self-refresh operation are evaluated using the 16 nm SPICE model. Possible limitations in extending the work to real hardware are also discussed. As a potential opportunity to motivate for future implementation, DRAM architectural changes, additional low power states and entry/exit flow to exercise reduced voltage operation in self-refresh mode are proposed. We present this new low power mode as Voltage Reduced Self-Refresh (VRSR) operation. Our simulation results show that there is a maximum of 12.4% and an average of 4% workload energy savings, with less than 0.7% performance loss across all benchmarks, for an aggressive voltage reduction of 150 mV.

现代计算系统要求DRAM具有更大的容量和带宽,以跟上新的数据密集型应用的冲击。尽管DRAM扩展提供了更高密度的器件来实现高存储容量系统,但能耗已成为关键的设计限制因素。这是由于存储器子系统继续对整个系统能量的很大一部分负责。自刷新模式是一种消耗最少DRAM能量的低功耗状态,这是避免数据丢失的重要操作。然而,自刷新能量也随着密度的缩放而继续增长。本文对通过降低电源电压来降低自刷新能量进行了详细的研究。Gem5全系统模式下的PARSEC基准用于量化在正常、降低和扩展温度范围内降低电压时自刷新节能的优点。使用16nm SPICE模型评估了自刷新操作中涉及的基本操作的延迟影响。还讨论了将工作扩展到实际硬件的可能限制。作为激励未来实现的潜在机会,提出了DRAM架构变化、额外的低功率状态和进入/退出流程,以在自刷新模式下进行降压操作。我们将这种新的低功耗模式称为电压降低自刷新(VRSR)操作。我们的模拟结果表明,在150 mV的电压大幅降低的情况下,工作负载能量最大可节省约12.4%,平均可节省约4%,所有基准的性能损失均小于0.7%。
{"title":"Voltage Reduced Self Refresh (VRSR) for optimized energy savings in DRAM Memories","authors":"Diyanesh Chinnakkonda ,&nbsp;Venkata Kalyan Tavva ,&nbsp;M.B. Srinivas","doi":"10.1016/j.memori.2023.100058","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100058","url":null,"abstract":"<div><p>Modern computing systems demand DRAMs with more capacity and bandwidth to keep pace with the onslaught of new data-intensive applications. Though DRAM scaling offers higher density devices to realize high memory capacity systems, energy consumption has become a key design limiter. This is owing to the fact that the memory sub-system continues to be responsible for a significant fraction of overall system energy. Self-refresh mode is one low power state that consumes the least DRAM energy, and this is an essential operation to avoid data loss. However, self-refresh energy also continues to grow with density scaling. This paper carries out a detailed study of reducing self-refresh energy by reducing the supply voltage. PARSEC benchmarks in Gem5 full-system mode are used to quantify the merit of self-refresh energy savings at reduced voltages for normal, reduced, and extended temperature ranges. The latency impacts of basic operations involved in self-refresh operation are evaluated using the 16 nm SPICE model. Possible limitations in extending the work to real hardware are also discussed. As a potential opportunity to motivate for future implementation, DRAM architectural changes, additional low power states and entry/exit flow to exercise reduced voltage operation in self-refresh mode are proposed. We present this new low power mode as Voltage Reduced Self-Refresh (VRSR) operation. Our simulation results show that there is a maximum of <span><math><mo>∼</mo></math></span>12.4% and an average of <span><math><mo>∼</mo></math></span>4% workload energy savings, with less than 0.7% performance loss across all benchmarks, for an aggressive voltage reduction of 150 mV.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100058"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient grouping approach for fault tolerant weight mapping in memristive crossbar array 忆阻纵横制阵列中容错权值映射的有效分组方法
Pub Date : 2023-07-01 DOI: 10.1016/j.memori.2023.100045
Dev Narayan Yadav , Phrangboklang Lyngton Thangkhiew , Sandip Chakraborty , Indranil Sengupta

The ability of resistive memory (ReRAM) to naturally conduct vector–matrix multiplication (VMM), which is the primary operation carried out during the training and inference of neural networks, has caught the interest of researchers. The memristor crossbar is one of the desirable architectures to perform VMM because it offers various benefits over other memory technologies, including in-memory computing, low power, and high density. Direct downloading and chip-on-the-loop approaches are typically used to train ReRAM-based neural networks. In these methods, all weight computations are carried out by a host machine, and the computed weights are downloaded in the crossbar. It has been seen that the network does not deliver the same precision as promised by the host system once the weights have been downloaded. This is because crossbars contain a significant number of faulty memristors and suffer from cell resistance variations because of immature manufacturing technologies. As a result, a cell may not be able to take the exact weight values that the host system generates, and may lead to incorrect inferences. Existing techniques for fault-tolerant mapping either involve network retraining or employ a graph-matching strategy that comes with hardware, power, and latency overheads. In this paper, we propose a mapping method to tolerate the effect of defective memristors. In order to lessen the impact of faulty memristors, the mapping is done in a way that allows network weights to cover up faulty memristors. Further, this work prioritizes the different faults based on the frequency of occurrence. The mapping efficiency is found to increase significantly with low power, area and latency overheads in the proposed approach. Experimental analyses show considerable improvement as compared to state-of-the-art works.

电阻存储器(ReRAM)自然进行向量-矩阵乘法(VMM)的能力引起了研究人员的兴趣,这是神经网络训练和推理过程中的主要操作。忆阻器交叉开关是执行VMM的理想架构之一,因为它提供了优于其他存储器技术的各种优点,包括内存内计算、低功耗和高密度。直接下载和芯片在环方法通常用于训练基于ReRAM的神经网络。在这些方法中,所有的权重计算都由主机执行,并且计算的权重被下载到纵横制中。已经看到,一旦下载了权重,网络就不能提供与主机系统承诺的精度相同的精度。这是因为横杆包含大量有故障的忆阻器,并且由于制造技术不成熟而导致单元电阻变化。因此,细胞可能无法获得宿主系统生成的确切权重值,并可能导致错误的推断。现有的容错映射技术要么涉及网络再训练,要么采用伴随硬件、电源和延迟开销的图匹配策略。在本文中,我们提出了一种映射方法来容忍缺陷忆阻器的影响。为了减少故障忆阻器的影响,映射是以允许网络权重掩盖故障忆阻的方式进行的。此外,这项工作根据发生频率对不同的故障进行优先级排序。发现在所提出的方法中,映射效率在低功率、低面积和低延迟开销的情况下显著提高。实验分析表明,与最先进的工作相比,有了相当大的改进。
{"title":"Efficient grouping approach for fault tolerant weight mapping in memristive crossbar array","authors":"Dev Narayan Yadav ,&nbsp;Phrangboklang Lyngton Thangkhiew ,&nbsp;Sandip Chakraborty ,&nbsp;Indranil Sengupta","doi":"10.1016/j.memori.2023.100045","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100045","url":null,"abstract":"<div><p>The ability of resistive memory (ReRAM) to naturally conduct vector–matrix multiplication (VMM), which is the primary operation carried out during the training and inference of neural networks, has caught the interest of researchers. The memristor crossbar is one of the desirable architectures to perform VMM because it offers various benefits over other memory technologies, including in-memory computing, low power, and high density. Direct downloading and chip-on-the-loop approaches are typically used to train ReRAM-based neural networks. In these methods, all weight computations are carried out by a host machine, and the computed weights are downloaded in the crossbar. It has been seen that the network does not deliver the same precision as promised by the host system once the weights have been downloaded. This is because crossbars contain a significant number of faulty memristors and suffer from cell resistance variations because of immature manufacturing technologies. As a result, a cell may not be able to take the exact weight values that the host system generates, and may lead to incorrect inferences. Existing techniques for fault-tolerant mapping either involve network retraining or employ a graph-matching strategy that comes with hardware, power, and latency overheads. In this paper, we propose a mapping method to tolerate the effect of defective memristors. In order to lessen the impact of faulty memristors, the mapping is done in a way that allows network weights to cover up faulty memristors. Further, this work prioritizes the different faults based on the frequency of occurrence. The mapping efficiency is found to increase significantly with low power, area and latency overheads in the proposed approach. Experimental analyses show considerable improvement as compared to state-of-the-art works.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100045"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Memories - Materials, Devices, Circuits and Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1