Pub Date : 2023-07-01DOI: 10.1016/j.memori.2023.100038
Hao You , Amirali Amirsoleimani , Jianxiong Xu , Mostafa Rahimi Azghadi , Roman Genov
This work presents a novel 4-bit subranging nonuniform sampling (NUS) memristive neural network-based analog-to-digital converter (ADC) with improved performance trade-off among speed, power, area, and accuracy. The proposed design preserves the memristive neural network calibration and utilizes a trainable memristor weight to adapt to device mismatch and increase accuracy. Rather than conventional binary searching, we adopt quaternary searching in the ADC to realize subranging architecture’s coarse and fine bits determination. A level-crossing nonuniform sampling (NUS) is introduced to the proposed ADC to enhance the ENOB under the same resolutions, power, and area consumption. Area and power consumption are reduced through circuit sharing between different stages of bit determination. The proposed 4-bit ADC achieves a highest ENOB of 5.96 and 5.6 at cut-off frequency (128 ) with power consumption of 0.515 and a figure of merit (FoM) of 82.95 .
{"title":"A subranging nonuniform sampling memristive neural network-based analog-to-digital converter","authors":"Hao You , Amirali Amirsoleimani , Jianxiong Xu , Mostafa Rahimi Azghadi , Roman Genov","doi":"10.1016/j.memori.2023.100038","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100038","url":null,"abstract":"<div><p>This work presents a novel 4-bit subranging nonuniform sampling (NUS) memristive neural network-based analog-to-digital converter (ADC) with improved performance trade-off among speed, power, area, and accuracy. The proposed design preserves the memristive neural network calibration and utilizes a trainable memristor weight to adapt to device mismatch and increase accuracy. Rather than conventional binary searching, we adopt quaternary searching in the ADC to realize subranging architecture’s coarse and fine bits determination. A level-crossing nonuniform sampling (NUS) is introduced to the proposed ADC to enhance the ENOB under the same resolutions, power, and area consumption. Area and power consumption are reduced through circuit sharing between different stages of bit determination. The proposed 4-bit ADC achieves a highest ENOB of 5.96 and 5.6 at cut-off frequency (128 <span><math><mi>MHz</mi></math></span>) with power consumption of 0.515 <span><math><mi>mW</mi></math></span> and a figure of merit (FoM) of 82.95 <span><math><mi>fJ/conv</mi></math></span>.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100038"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.memori.2023.100043
M.A. Carrasco-Aguilar, F.E. Morales-López, C. Sánchez-López, Rocio Ochoa-Montiel
In this article, the flux-charge analysis method is applied to obtain the theoretical response of the voltage generated in a parallel Memristor–Capacitor (M–C) circuit excited by an input pulse generator with a 100 kHz frequency, 5 V amplitude and a 50 ohms output impedance. The theoretical solution of the nonlinear ordinary differential equation that results when applying the method is reached by a numerical method. As a memristive circuit, a previously reported floating memristor emulator was used. The response obtained is compared with the experimental response, generating evidence that the applied analysis method yields an acceptable margin of error with regards to the experimental results obtained, contrasting with other similar reports, where the analyzes are based on theoretical memristive models, and show simulation results only. Summary, the paper would contribute to the analysis and experimental verification of the parallel M–C circuit subjected to a real switched exciting source, using a memristance equation established in an emulator that is different from the equations commonly used in the literature.
{"title":"Flux-charge analysis and experimental verification of a parallel Memristor–Capacitor circuit","authors":"M.A. Carrasco-Aguilar, F.E. Morales-López, C. Sánchez-López, Rocio Ochoa-Montiel","doi":"10.1016/j.memori.2023.100043","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100043","url":null,"abstract":"<div><p>In this article, the flux-charge analysis method is applied to obtain the theoretical response of the voltage generated in a parallel Memristor–Capacitor (M–C) circuit excited by an input pulse generator with a 100 kHz frequency, 5 V amplitude and a 50 ohms output impedance. The theoretical solution of the nonlinear ordinary differential equation that results when applying the method is reached by a numerical method. As a memristive circuit, a previously reported floating memristor emulator was used. The response obtained is compared with the experimental response, generating evidence that the applied analysis method yields an acceptable margin of error with regards to the experimental results obtained, contrasting with other similar reports, where the analyzes are based on theoretical memristive models, and show simulation results only. Summary, the paper would contribute to the analysis and experimental verification of the parallel M–C circuit subjected to a real switched exciting source, using a memristance equation established in an emulator that is different from the equations commonly used in the literature.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100043"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.memori.2023.100051
Dina Fakhry , Mohamed Abdelsalam , M. Watheq El-Kharashi , Mona Safar
The von Neumann bottleneck is imposed due to the explosion of data transfers and emerging data-intensive applications in heterogeneous system architectures. The conventional computation approach of transferring data to CPU is no longer suitable especially with the cost it imposes. Given the increasing storage capacities, moving extensive data volumes between storage and computation cannot scale up. Hence, high-performance data processing mechanisms are needed, which may be achieved by bringing computation closer to data. Gathering insights where data is stored helps deal with energy efficiency, low latency, as well as security. Storage bus bandwidth is also saved when only computation results are delivered to the host memory. Various applications, including database acceleration, machine learning, Artificial Intelligence (AI), offloading (compression/encryption/encoding) and others can perform better and become more scalable if the “move process to data” paradigm is applied. Embedding processing engines inside Solid-State Drives (SSDs), transforming them to Computational Storage Devices (CSDs), provides the needed data processing solution. In this paper, we review the prior art on Near Data Processing (NDP) with focus on In-Storage Computing (ISC), identifying main challenges and potential gaps for future research directions.
von Neumann瓶颈是由于异构系统体系结构中数据传输的爆炸性增长和新兴的数据密集型应用而造成的。将数据传输到CPU的传统计算方法不再适用,尤其是考虑到它带来的成本。考虑到存储容量的不断增加,在存储和计算之间移动大量数据量无法扩大规模。因此,需要高性能的数据处理机制,这可以通过使计算更接近数据来实现。收集数据存储位置的见解有助于处理能源效率、低延迟以及安全问题。当只有计算结果被传送到主机存储器时,存储总线带宽也被节省。如果应用“向数据转移过程”范式,包括数据库加速、机器学习、人工智能(AI)、卸载(压缩/加密/编码)等在内的各种应用程序可以表现得更好,并变得更具可扩展性。将处理引擎嵌入固态硬盘(SSD)中,将其转换为计算存储设备(CSD),提供了所需的数据处理解决方案。在本文中,我们回顾了近数据处理(NDP)的现有技术,重点是存储内计算(ISC),确定了未来研究方向的主要挑战和潜在差距。
{"title":"A review on computational storage devices and near memory computing for high performance applications","authors":"Dina Fakhry , Mohamed Abdelsalam , M. Watheq El-Kharashi , Mona Safar","doi":"10.1016/j.memori.2023.100051","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100051","url":null,"abstract":"<div><p>The von Neumann bottleneck is imposed due to the explosion of data transfers and emerging data-intensive applications in heterogeneous system architectures. The conventional computation approach of transferring data to CPU is no longer suitable especially with the cost it imposes. Given the increasing storage capacities, moving extensive data volumes between storage and computation cannot scale up. Hence, high-performance data processing mechanisms are needed, which may be achieved by bringing computation closer to data. Gathering insights where data is stored helps deal with energy efficiency, low latency, as well as security. Storage bus bandwidth is also saved when only computation results are delivered to the host memory. Various applications, including database acceleration, machine learning, Artificial Intelligence (AI), offloading (compression/encryption/encoding) and others can perform better and become more scalable if the “move process to data” paradigm is applied. Embedding processing engines inside Solid-State Drives (SSDs), transforming them to Computational Storage Devices (CSDs), provides the needed data processing solution. In this paper, we review the prior art on Near Data Processing (NDP) with focus on In-Storage Computing (ISC), identifying main challenges and potential gaps for future research directions.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100051"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50200166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.memori.2023.100049
Divya Singh, Sajal K. Paul
This work presents a new active block called the differential current conveyor cascaded transconductance amplifier (DCCCTA) and implemented multi-mode biquadratic universal shadow filter. The frequency-hopping filter is implemented using a multi-mode universal shadow filter. The proposed circuit has two modes of operation: current mode (CM) and transadmittance mode (TAM). All-pass (AP), band-pass (BP), band-reject (BR), high-pass (HP), and low-pass (LP) responses are simultaneously accomplished. As intended, low input impedance for CM and high input impedance for TAM are acquired, while high output impedance is attained for both modes of operation. Inter-modulation distortion (IMD), percentage total harmonic distortion (%THD), and Monte Carlo analysis are also obtained. The theoretical results are verified using Cadence Virtuoso in 180 nm TSMC technology.
{"title":"Realization of multi-mode universal shadow filter and its application as a frequency-hopping filter","authors":"Divya Singh, Sajal K. Paul","doi":"10.1016/j.memori.2023.100049","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100049","url":null,"abstract":"<div><p>This work presents a new active block called the differential current conveyor cascaded transconductance amplifier (DCCCTA) and implemented multi-mode biquadratic universal shadow filter. The frequency-hopping filter is implemented using a multi-mode universal shadow filter. The proposed circuit has two modes of operation: current mode (CM) and transadmittance mode (TAM). All-pass (AP), band-pass (BP), band-reject (BR), high-pass (HP), and low-pass (LP) responses are simultaneously accomplished. As intended, low input impedance for CM and high input impedance for TAM are acquired, while high output impedance is attained for both modes of operation. Inter-modulation distortion (IMD), percentage total harmonic distortion (%THD), and Monte Carlo analysis are also obtained. The theoretical results are verified using Cadence Virtuoso in 180 nm TSMC technology.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100049"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.memori.2023.100039
Ilghar Rezaei , Ava Salmanpour , Toktam Aghaee
A two-layers, multi-band super absorber with the capability of being tuned is proposed in this paper. The idea behind the design is to realize periodic arrays of graphene disks via graphene ribbons with different lengths. Then circuit modeling is developed to be used alongside the impedance matching concept to achieve more than ten absorption peaks. The exploited spacer is a lossless polymer in the THz frequency range while the bottom of the device is occupied by a relatively thick golden plate. The developed circuit model description is verified by full-wave simulation. According to the simulation results, the proposed absorber shows more than ten peaks with absorption over 90%. The peak frequencies are interestingly able to be shifted via exploited single chemical potential variations. Additionally, deviations of absorber response against graphene electron relaxation time and device geometry are shown to be marginal which makes the presented meta-absorber, a reliable optical device.
{"title":"Reconfigurable optoelectronic absorber based on nested nano disk-ribbon graphene Pattern in THz range","authors":"Ilghar Rezaei , Ava Salmanpour , Toktam Aghaee","doi":"10.1016/j.memori.2023.100039","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100039","url":null,"abstract":"<div><p>A two-layers, multi-band super absorber with the capability of being tuned is proposed in this paper. The idea behind the design is to realize periodic arrays of graphene disks via graphene ribbons with different lengths. Then circuit modeling is developed to be used alongside the impedance matching concept to achieve more than ten absorption peaks. The exploited spacer is a lossless polymer in the THz frequency range while the bottom of the device is occupied by a relatively thick golden plate. The developed circuit model description is verified by full-wave simulation. According to the simulation results, the proposed absorber shows more than ten peaks with absorption over 90%. The peak frequencies are interestingly able to be shifted via exploited single chemical potential variations. Additionally, deviations of absorber response against graphene electron relaxation time and device geometry are shown to be marginal which makes the presented meta-absorber, a reliable optical device.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100039"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The examination of the effect of retrograde channel doping on reliability and performance of 3-D junction-free NAND based flash memory is done for this paper. Specifically, we study the program characteristics, data retention capability junction-free NAND flash memory with half pitch range from 35 nm to 12 nm. Based on our analysis, we highlight that the retrograde channel doping approach can improve not only the SCEs but also the program speed and data control time for 3-D junction-free NAND flash memory, without varying the oxide stack in charge trap-based flash memory.
{"title":"Improvement of memory performance of 3-D NAND flash memory with retrograde channel doping","authors":"Deepika Gupta , Abhishek Kumar Upadhyay , Ankur Beohar , Santosh Kumar Vishvakarma","doi":"10.1016/j.memori.2023.100031","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100031","url":null,"abstract":"<div><p>The examination of the effect of retrograde channel doping on reliability and performance of 3-D junction-free NAND based flash memory is done for this paper. Specifically, we study the program characteristics, data retention capability junction-free NAND flash memory with half pitch range from 35 nm to 12 nm. Based on our analysis, we highlight that the retrograde channel doping approach can improve not only the SCEs but also the program speed and data control time for 3-D junction-free NAND flash memory, without varying the oxide stack in charge trap-based flash memory.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100031"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.memori.2023.100034
Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar Kvatinsky
The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive to , and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators such as GPUs cannot exploit their full computing capabilities since memory access becomes the bottleneck. Therefore, this paper accelerates the FFT algorithm using digital Processing-in-Memory (PIM) architectures that shift computation into the memory by exploiting physical devices capable of both storage and logic (e.g., memristors). We propose an in-memory FFT algorithm that can also be performed in parallel across multiple arrays for high-throughput batched execution, supporting both fixed-point and floating-point numbers. Through the convolution theorem, we extend this algorithm to polynomial multiplication – a fundamental task for applications such as cryptography. We evaluate FourierPIM on a publicly-available cycle-accurate simulator that verifies both correctness and performance, and demonstrate throughput and energy improvement over the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial multiplication.
{"title":"FourierPIM: High-throughput in-memory Fast Fourier Transform and polynomial multiplication","authors":"Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar Kvatinsky","doi":"10.1016/j.memori.2023.100034","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100034","url":null,"abstract":"<div><p>The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>n</mi><mo>log</mo><mi>n</mi><mo>)</mo></mrow></mrow></math></span>, and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators such as GPUs cannot exploit their full computing capabilities since memory access becomes the bottleneck. Therefore, this paper accelerates the FFT algorithm using digital Processing-in-Memory (PIM) architectures that shift computation into the memory by exploiting physical devices capable of both storage and logic (e.g., memristors). We propose an <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mo>log</mo><mi>n</mi><mo>)</mo></mrow></mrow></math></span> in-memory FFT algorithm that can also be performed in parallel across multiple arrays for <em>high-throughput batched execution</em>, supporting both fixed-point and floating-point numbers. Through the convolution theorem, we extend this algorithm to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mo>log</mo><mi>n</mi><mo>)</mo></mrow></mrow></math></span> polynomial multiplication – a fundamental task for applications such as cryptography. We evaluate FourierPIM on a publicly-available cycle-accurate simulator that verifies both correctness and performance, and demonstrate <span><math><mrow><mtext>5–15</mtext><mo>×</mo></mrow></math></span> throughput and <span><math><mrow><mtext>4–13</mtext><mo>×</mo></mrow></math></span> energy improvement over the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial multiplication.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100034"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.memori.2022.100022
Kazi Asifuzzaman, Narasinga Rao Miniskar, Aaron R. Young, Frank Liu, Jeffrey S. Vetter
Processing-in-memory (PIM) techniques have gained much attention from computer architecture researchers, and significant research effort has been invested in exploring and developing such techniques. Increasing the research activity dedicated to improving PIM techniques will hopefully help deliver PIM’s promise to solve or significantly reduce memory access bottleneck problems for memory-intensive applications. We also believe it is imperative to track the advances made in PIM research to identify open challenges and enable the research community to make informed decisions and adjust future research directions. In this survey, we analyze recent studies that explored PIM techniques, summarize the advances made, compare recent PIM architectures, and identify target application domains and suitable memory technologies. We also discuss proposals that address unresolved issues of PIM designs (e.g., address translation/mapping of operands, workload analysis to identify application segments that can be accelerated with PIM, OS/runtime support, and coherency issues that must be resolved to incorporate PIM). We believe this work can serve as a useful reference for researchers exploring PIM techniques.
{"title":"A survey on processing-in-memory techniques: Advances and challenges","authors":"Kazi Asifuzzaman, Narasinga Rao Miniskar, Aaron R. Young, Frank Liu, Jeffrey S. Vetter","doi":"10.1016/j.memori.2022.100022","DOIUrl":"https://doi.org/10.1016/j.memori.2022.100022","url":null,"abstract":"<div><p>Processing-in-memory (PIM) techniques have gained much attention from computer architecture researchers, and significant research effort has been invested in exploring and developing such techniques. Increasing the research activity dedicated to improving PIM techniques will hopefully help deliver PIM’s promise to solve or significantly reduce memory access bottleneck problems for memory-intensive applications. We also believe it is imperative to track the advances made in PIM research to identify open challenges and enable the research community to make informed decisions and adjust future research directions. In this survey, we analyze recent studies that explored PIM techniques, summarize the advances made, compare recent PIM architectures, and identify target application domains and suitable memory technologies. We also discuss proposals that address unresolved issues of PIM designs (e.g., address translation/mapping of operands, workload analysis to identify application segments that can be accelerated with PIM, OS/runtime support, and coherency issues that must be resolved to incorporate PIM). We believe this work can serve as a useful reference for researchers exploring PIM techniques.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100022"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50200136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern computing systems demand DRAMs with more capacity and bandwidth to keep pace with the onslaught of new data-intensive applications. Though DRAM scaling offers higher density devices to realize high memory capacity systems, energy consumption has become a key design limiter. This is owing to the fact that the memory sub-system continues to be responsible for a significant fraction of overall system energy. Self-refresh mode is one low power state that consumes the least DRAM energy, and this is an essential operation to avoid data loss. However, self-refresh energy also continues to grow with density scaling. This paper carries out a detailed study of reducing self-refresh energy by reducing the supply voltage. PARSEC benchmarks in Gem5 full-system mode are used to quantify the merit of self-refresh energy savings at reduced voltages for normal, reduced, and extended temperature ranges. The latency impacts of basic operations involved in self-refresh operation are evaluated using the 16 nm SPICE model. Possible limitations in extending the work to real hardware are also discussed. As a potential opportunity to motivate for future implementation, DRAM architectural changes, additional low power states and entry/exit flow to exercise reduced voltage operation in self-refresh mode are proposed. We present this new low power mode as Voltage Reduced Self-Refresh (VRSR) operation. Our simulation results show that there is a maximum of 12.4% and an average of 4% workload energy savings, with less than 0.7% performance loss across all benchmarks, for an aggressive voltage reduction of 150 mV.
{"title":"Voltage Reduced Self Refresh (VRSR) for optimized energy savings in DRAM Memories","authors":"Diyanesh Chinnakkonda , Venkata Kalyan Tavva , M.B. Srinivas","doi":"10.1016/j.memori.2023.100058","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100058","url":null,"abstract":"<div><p>Modern computing systems demand DRAMs with more capacity and bandwidth to keep pace with the onslaught of new data-intensive applications. Though DRAM scaling offers higher density devices to realize high memory capacity systems, energy consumption has become a key design limiter. This is owing to the fact that the memory sub-system continues to be responsible for a significant fraction of overall system energy. Self-refresh mode is one low power state that consumes the least DRAM energy, and this is an essential operation to avoid data loss. However, self-refresh energy also continues to grow with density scaling. This paper carries out a detailed study of reducing self-refresh energy by reducing the supply voltage. PARSEC benchmarks in Gem5 full-system mode are used to quantify the merit of self-refresh energy savings at reduced voltages for normal, reduced, and extended temperature ranges. The latency impacts of basic operations involved in self-refresh operation are evaluated using the 16 nm SPICE model. Possible limitations in extending the work to real hardware are also discussed. As a potential opportunity to motivate for future implementation, DRAM architectural changes, additional low power states and entry/exit flow to exercise reduced voltage operation in self-refresh mode are proposed. We present this new low power mode as Voltage Reduced Self-Refresh (VRSR) operation. Our simulation results show that there is a maximum of <span><math><mo>∼</mo></math></span>12.4% and an average of <span><math><mo>∼</mo></math></span>4% workload energy savings, with less than 0.7% performance loss across all benchmarks, for an aggressive voltage reduction of 150 mV.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100058"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ability of resistive memory (ReRAM) to naturally conduct vector–matrix multiplication (VMM), which is the primary operation carried out during the training and inference of neural networks, has caught the interest of researchers. The memristor crossbar is one of the desirable architectures to perform VMM because it offers various benefits over other memory technologies, including in-memory computing, low power, and high density. Direct downloading and chip-on-the-loop approaches are typically used to train ReRAM-based neural networks. In these methods, all weight computations are carried out by a host machine, and the computed weights are downloaded in the crossbar. It has been seen that the network does not deliver the same precision as promised by the host system once the weights have been downloaded. This is because crossbars contain a significant number of faulty memristors and suffer from cell resistance variations because of immature manufacturing technologies. As a result, a cell may not be able to take the exact weight values that the host system generates, and may lead to incorrect inferences. Existing techniques for fault-tolerant mapping either involve network retraining or employ a graph-matching strategy that comes with hardware, power, and latency overheads. In this paper, we propose a mapping method to tolerate the effect of defective memristors. In order to lessen the impact of faulty memristors, the mapping is done in a way that allows network weights to cover up faulty memristors. Further, this work prioritizes the different faults based on the frequency of occurrence. The mapping efficiency is found to increase significantly with low power, area and latency overheads in the proposed approach. Experimental analyses show considerable improvement as compared to state-of-the-art works.
{"title":"Efficient grouping approach for fault tolerant weight mapping in memristive crossbar array","authors":"Dev Narayan Yadav , Phrangboklang Lyngton Thangkhiew , Sandip Chakraborty , Indranil Sengupta","doi":"10.1016/j.memori.2023.100045","DOIUrl":"https://doi.org/10.1016/j.memori.2023.100045","url":null,"abstract":"<div><p>The ability of resistive memory (ReRAM) to naturally conduct vector–matrix multiplication (VMM), which is the primary operation carried out during the training and inference of neural networks, has caught the interest of researchers. The memristor crossbar is one of the desirable architectures to perform VMM because it offers various benefits over other memory technologies, including in-memory computing, low power, and high density. Direct downloading and chip-on-the-loop approaches are typically used to train ReRAM-based neural networks. In these methods, all weight computations are carried out by a host machine, and the computed weights are downloaded in the crossbar. It has been seen that the network does not deliver the same precision as promised by the host system once the weights have been downloaded. This is because crossbars contain a significant number of faulty memristors and suffer from cell resistance variations because of immature manufacturing technologies. As a result, a cell may not be able to take the exact weight values that the host system generates, and may lead to incorrect inferences. Existing techniques for fault-tolerant mapping either involve network retraining or employ a graph-matching strategy that comes with hardware, power, and latency overheads. In this paper, we propose a mapping method to tolerate the effect of defective memristors. In order to lessen the impact of faulty memristors, the mapping is done in a way that allows network weights to cover up faulty memristors. Further, this work prioritizes the different faults based on the frequency of occurrence. The mapping efficiency is found to increase significantly with low power, area and latency overheads in the proposed approach. Experimental analyses show considerable improvement as compared to state-of-the-art works.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100045"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50199591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}