Pub Date : 2025-06-23DOI: 10.1109/TVLSI.2025.3578619
Kun Li;Xiangyu Hao;Zhenguo Ma;Feng Yu;Bo Zhang;Qianjian Xing
This brief presents a pipelined floating-point Multiply–Accumulator (FPMAC) architecture designed to accelerate sparse linear algebra operations. By designing a lookup-table-based 5–3 carry-save adder (CSA) and combining it with a 3–2 CSA, the proposed design minimizes the critical path and boosts operational speed. Moreover, the proposed architecture takes advantage of data characteristics in sparse linear algebra to displace the shift unit in the critical accumulation loop, further increasing the throughput rate. In addition, the integration of a lookup-table-based leading-zero anticipator (LZA) enhances normalization efficiency. Experimental results show that, compared with reported FPMAC designs, the proposed architecture may achieve a significantly higher maximum clock frequency for single-precision floating-point operations.
{"title":"A Fast Floating-Point Multiply–Accumulator Optimized for Sparse Linear Algebra on FPGAs","authors":"Kun Li;Xiangyu Hao;Zhenguo Ma;Feng Yu;Bo Zhang;Qianjian Xing","doi":"10.1109/TVLSI.2025.3578619","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578619","url":null,"abstract":"This brief presents a pipelined floating-point Multiply–Accumulator (FPMAC) architecture designed to accelerate sparse linear algebra operations. By designing a lookup-table-based 5–3 carry-save adder (CSA) and combining it with a 3–2 CSA, the proposed design minimizes the critical path and boosts operational speed. Moreover, the proposed architecture takes advantage of data characteristics in sparse linear algebra to displace the shift unit in the critical accumulation loop, further increasing the throughput rate. In addition, the integration of a lookup-table-based leading-zero anticipator (LZA) enhances normalization efficiency. Experimental results show that, compared with reported FPMAC designs, the proposed architecture may achieve a significantly higher maximum clock frequency for single-precision floating-point operations.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2592-2596"},"PeriodicalIF":3.1,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-23DOI: 10.1109/TVLSI.2025.3574427
Jingzhou Li;Fangfei Yu;Mingyuan Ma;Wei Liu;Yuhan Wang;Hualin Wu;Hu He
General-purpose graphics processing units (GPGPUs) have become a leading platform for accelerating modern compute-intensive applications, such as large language models and generative artificial intelligence (AI). However, the lack of advanced open-source GPGPU microarchitectures has hindered high-performance research in this area. In this article, we present Ventus, a high-performance open-source GPGPU implementation built upon the RISC-V architecture with vector extension [RISC-V vector (RVV)]. Ventus introduces customized instructions and a comprehensive software toolchain to optimize performance. We deployed the design on a field programmable gate array (FPGA) platform consisting of 4 Xilinx VU19P devices, scaling up to 16 streaming multiprocessors (SMs) and supporting 256 warps. Experimental results demonstrate that Ventus exhibits key performance features comparable to commercial GPGPUs, achieving an average of 83.9% instruction reduction and 87.4% cycle per instruction (CPI) improvement over the leading open-source alternatives. Under 4-, 8-, and 16-thread configurations, Ventus maintains robust instruction per cycle (IPC) performance with values of 0.47, 0.40, and 0.32, respectively. In addition, the tensor core of Ventus attains an extra average reduction of 69.1% in instruction count and a 68.4% cycle reduction ratio when running AI-related workloads. These findings highlight Ventus as a promising solution for future high-performance GPGPU research and development, offering a robust open-source alternative to proprietary solutions. Ventus can be found on https://github.com/THU-DSP-LAB/ventus-gpgpu
{"title":"RISC-V-Based GPGPU With Vector Capabilities for High-Performance Computing","authors":"Jingzhou Li;Fangfei Yu;Mingyuan Ma;Wei Liu;Yuhan Wang;Hualin Wu;Hu He","doi":"10.1109/TVLSI.2025.3574427","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3574427","url":null,"abstract":"General-purpose graphics processing units (GPGPUs) have become a leading platform for accelerating modern compute-intensive applications, such as large language models and generative artificial intelligence (AI). However, the lack of advanced open-source GPGPU microarchitectures has hindered high-performance research in this area. In this article, we present Ventus, a high-performance open-source GPGPU implementation built upon the RISC-V architecture with vector extension [RISC-V vector (RVV)]. Ventus introduces customized instructions and a comprehensive software toolchain to optimize performance. We deployed the design on a field programmable gate array (FPGA) platform consisting of 4 Xilinx VU19P devices, scaling up to 16 streaming multiprocessors (SMs) and supporting 256 warps. Experimental results demonstrate that Ventus exhibits key performance features comparable to commercial GPGPUs, achieving an average of 83.9% instruction reduction and 87.4% cycle per instruction (CPI) improvement over the leading open-source alternatives. Under 4-, 8-, and 16-thread configurations, Ventus maintains robust instruction per cycle (IPC) performance with values of 0.47, 0.40, and 0.32, respectively. In addition, the tensor core of Ventus attains an extra average reduction of 69.1% in instruction count and a 68.4% cycle reduction ratio when running AI-related workloads. These findings highlight Ventus as a promising solution for future high-performance GPGPU research and development, offering a robust open-source alternative to proprietary solutions. Ventus can be found on <uri>https://github.com/THU-DSP-LAB/ventus-gpgpu</uri>","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2239-2251"},"PeriodicalIF":2.8,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-20DOI: 10.1109/TVLSI.2025.3578959
Chuanjie Chen;Xiangyu Meng;Wang Xie;Baoyong Chi
Delay solutions applied to high frequencies typically involve switched transmission lines or all-pass filters. These solutions often suffer from significant insertion loss and drastic gain variations at high frequencies, along with poor delay flatness. In this work, we have designed a delay circuit that can be applied to high frequencies, featuring excellent delay flatness, good delay resolution, and a wide bandwidth. In this design, a multistage cascaded sampling circuit is used to generate delays. By introducing differential clocks or three-phase clocks, simple coarse delay or fine delay can be achieved. The measurement results show that the sample and hold circuit achieves a delay accuracy of 17.5 ps and a delay range of 453 ps within 0.5–2.5 GHz, with a gain of −2.7 to 2 dB and a gain variation of ±0.85 dB, a delay variation less than 7.5 ps, a power consumption of 111 mW, and a core area of 0.137 mm2.
{"title":"A Sample-and-Hold-Based 453-ps True Time Delay Circuit With a Wide Bandwidth of 0.5–2.5 GHz in 65-nm CMOS","authors":"Chuanjie Chen;Xiangyu Meng;Wang Xie;Baoyong Chi","doi":"10.1109/TVLSI.2025.3578959","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578959","url":null,"abstract":"Delay solutions applied to high frequencies typically involve switched transmission lines or all-pass filters. These solutions often suffer from significant insertion loss and drastic gain variations at high frequencies, along with poor delay flatness. In this work, we have designed a delay circuit that can be applied to high frequencies, featuring excellent delay flatness, good delay resolution, and a wide bandwidth. In this design, a multistage cascaded sampling circuit is used to generate delays. By introducing differential clocks or three-phase clocks, simple coarse delay or fine delay can be achieved. The measurement results show that the sample and hold circuit achieves a delay accuracy of 17.5 ps and a delay range of 453 ps within 0.5–2.5 GHz, with a gain of −2.7 to 2 dB and a gain variation of ±0.85 dB, a delay variation less than 7.5 ps, a power consumption of 111 mW, and a core area of 0.137 mm<sup>2</sup>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2344-2348"},"PeriodicalIF":2.8,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unstructured pruning introduces significant sparsity in deep neural networks (DNNs), enhancing accelerator hardware efficiency. However, three critical challenges constrain performance gains: 1) complex fetching logic for nonzero (NZ) data pairs; 2) load imbalance across processing elements (PEs); and 3) PE stalls from write-back contention. This brief proposes an energy-efficient accelerator addressing these inefficiencies through three innovations. First, we propose a Cartesian-product output-row-stationary (CPORS) dataflow that inherently matches NZ data pairs by sequentially fetching compressed data. Second, a multilevel partial sum reduction (MLPR) strategy minimizes write-back traffic and converts random PE stalls into manageable load imbalance. Third, a kernel sorting and load scheduling (KSLS) mechanism resolves PE idle/stall and achieves PE array-level load balancing, attaining 76.6% average PE utilization across all sparsity levels. Implemented in 22-nm CMOS, the accelerator delivers $1.85times $ speedup and $1.4times $ energy efficiency over baseline and achieves 25.8 TOPS/W peak energy efficiency at 90% sparsity.
{"title":"Accelerating Unstructured Sparse DNNs via Multilevel Partial Sum Reduction and PE Array-Level Load Balancing","authors":"Chendong Xia;Qiang Li;Zhi Li;Bing Li;Huidong Zhao;Shushan Qiao","doi":"10.1109/TVLSI.2025.3577626","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3577626","url":null,"abstract":"Unstructured pruning introduces significant sparsity in deep neural networks (DNNs), enhancing accelerator hardware efficiency. However, three critical challenges constrain performance gains: 1) complex fetching logic for nonzero (NZ) data pairs; 2) load imbalance across processing elements (PEs); and 3) PE stalls from write-back contention. This brief proposes an energy-efficient accelerator addressing these inefficiencies through three innovations. First, we propose a Cartesian-product output-row-stationary (CPORS) dataflow that inherently matches NZ data pairs by sequentially fetching compressed data. Second, a multilevel partial sum reduction (MLPR) strategy minimizes write-back traffic and converts random PE stalls into manageable load imbalance. Third, a kernel sorting and load scheduling (KSLS) mechanism resolves PE idle/stall and achieves PE array-level load balancing, attaining 76.6% average PE utilization across all sparsity levels. Implemented in 22-nm CMOS, the accelerator delivers <inline-formula> <tex-math>$1.85times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$1.4times $ </tex-math></inline-formula> energy efficiency over baseline and achieves 25.8 TOPS/W peak energy efficiency at 90% sparsity.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2329-2333"},"PeriodicalIF":2.8,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Resistive random access memory (RRAM)-based in-memory computing (IMC) architectures are currently receiving widespread attention. Since this computing approach relies on the analog characteristics of the devices, the write variation of RRAM can affect the computational accuracy to varying degrees. Conventional write–verify (W&V) procedures are performed on all weight parameters, resulting in significant time overhead. To address this issue, we propose a training algorithm that can recover the offline IMC accuracy impacted by write variation with a lower cost of W&V overhead. We introduce a importance-driven weight allocation (IDWA) algorithm during the training process of the neural network. This algorithm constrains the values of less important weights to suppress the diffusion of variation interference on this part of the weights, thus reducing unnecessary accuracy degradation. Additionally, we employ a layer-wise optimization algorithm to identify important weights in the neural network for W&V operations. Extensive testing across various deep neural networks (DNNs) architectures and datasets demonstrates that our proposed selective W&V methodology consistently outperforms current state-of-the-art selective W&V techniques in both accuracy preservation and computational efficiency. At same accuracy levels, it delivers a speed improvement of $6times sim 32times $ compared to other advanced methods.
{"title":"IDWA: A Importance-Driven Weight Allocation Algorithm for Low Write–Verify Ratio RRAM-Based In-Memory Computing","authors":"Jingyuan Qu;Debao Wei;Dejun Zhang;Yanlong Zeng;Zhelong Piao;Liyan Qiao","doi":"10.1109/TVLSI.2025.3578388","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578388","url":null,"abstract":"Resistive random access memory (RRAM)-based in-memory computing (IMC) architectures are currently receiving widespread attention. Since this computing approach relies on the analog characteristics of the devices, the write variation of RRAM can affect the computational accuracy to varying degrees. Conventional write–verify (W&V) procedures are performed on all weight parameters, resulting in significant time overhead. To address this issue, we propose a training algorithm that can recover the offline IMC accuracy impacted by write variation with a lower cost of W&V overhead. We introduce a importance-driven weight allocation (IDWA) algorithm during the training process of the neural network. This algorithm constrains the values of less important weights to suppress the diffusion of variation interference on this part of the weights, thus reducing unnecessary accuracy degradation. Additionally, we employ a layer-wise optimization algorithm to identify important weights in the neural network for W&V operations. Extensive testing across various deep neural networks (DNNs) architectures and datasets demonstrates that our proposed selective W&V methodology consistently outperforms current state-of-the-art selective W&V techniques in both accuracy preservation and computational efficiency. At same accuracy levels, it delivers a speed improvement of <inline-formula> <tex-math>$6times sim 32times $ </tex-math></inline-formula> compared to other advanced methods.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2508-2517"},"PeriodicalIF":3.1,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rise of artificial intelligence (AI), neural network applications are growing in demand for efficient data transmission. The traditional von Neumann architecture can no longer keep pace with modern technological needs. Computing-in-memory (CIM) is proposed as a promising solution to address this bottleneck. This work introduces a local computing cell (LCC) scheme based on compact 6T-SRAM cells. The proposed circuit aims to enhance energy efficiency and reduce power consumption by reusing the LCC. The LCC circuit can perform the multiplication of a 2-bit input with a 1-bit weight, which can be applied to convolutional neural networks (CNNs) with the multiply-accumulate (MAC) operations. Through circuit reuse, it can also be used for multibit multiply operations, performing 2-bit input multiplication and 1-bit weight addition, which can be applied to grayscale edge detection in images. The energy efficiency of the SRAM-CIM macro achieves an energy efficiency of 46.3 TOPS/W under MAC operations with input precision of 8-bits and weight precision of 8-bits, and up to 389.1–529.1 TOPS/W under the calculation in one subarray with an input precision of 2-bits and a weight precision of 1-bit. The estimated inference accuracy on CIFAR-10 datasets is 90.21%.
{"title":"A 28 nm Dual-Mode SRAM-CIM Macro With Local Computing Cell for CNNs and Grayscale Edge Detection","authors":"Chunyu Peng;Xiaohang Chen;Mengya Gao;Jiating Guo;Lijun Guan;Chenghu Dai;Zhiting Lin;Xiulong Wu","doi":"10.1109/TVLSI.2025.3578319","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578319","url":null,"abstract":"With the rise of artificial intelligence (AI), neural network applications are growing in demand for efficient data transmission. The traditional von Neumann architecture can no longer keep pace with modern technological needs. Computing-in-memory (CIM) is proposed as a promising solution to address this bottleneck. This work introduces a local computing cell (LCC) scheme based on compact 6T-SRAM cells. The proposed circuit aims to enhance energy efficiency and reduce power consumption by reusing the LCC. The LCC circuit can perform the multiplication of a 2-bit input with a 1-bit weight, which can be applied to convolutional neural networks (CNNs) with the multiply-accumulate (MAC) operations. Through circuit reuse, it can also be used for multibit multiply operations, performing 2-bit input multiplication and 1-bit weight addition, which can be applied to grayscale edge detection in images. The energy efficiency of the SRAM-CIM macro achieves an energy efficiency of 46.3 TOPS/W under MAC operations with input precision of 8-bits and weight precision of 8-bits, and up to 389.1–529.1 TOPS/W under the calculation in one subarray with an input precision of 2-bits and a weight precision of 1-bit. The estimated inference accuracy on CIFAR-10 datasets is 90.21%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2264-2273"},"PeriodicalIF":2.8,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-17DOI: 10.1109/TVLSI.2025.3576998
Ming Yan;Jaime Cardenas Chavez;Kamal El-Sankary;Li Chen;Xiaotong Lu
This article presents a 10-bit radiation-hardened-by-design (RHBD) SAR analog-to-digital converter (ADC) operating at 50 MS/s, designed for aerospace applications in high-radiation environments. The system- and circuit-level redundancy techniques are implemented to mitigate radiation-induced errors and metastability. A novel split coarse/fine asynchronous SAR ADC architecture is proposed to provide system-level redundancy. At circuits level, single-event effects (SEEs) error detection and radiation-hardened techniques are implemented. Our co-designed SEE error detection scheme includes last-bit-cycle (LBC) detection following the LSB cycle and metastability detection (MD) via a ramp generator with a threshold trigger. This approach detects and corrects radiation-induced errors using a coarse/fine redundant algorithm. The radiation-hardened latch comparators and D flip-flops (DFFs) are incorporated to further mitigate SEEs. The prototype design is fabricated using TSMC 65-nm technology, with an ADC core area of 0.0875 mm2 and a power consumption of 2.79 mW at a 1.2-V power supply. Postirradiation tests confirm functionality up to 100-krad(Si) total ionizing dose (TID) and demonstrate over 90% suppression of large SEE under laser testing.
{"title":"A 10-bit 50-MS/s Radiation Tolerant Split Coarse/Fine SAR ADC in 65-nm CMOS","authors":"Ming Yan;Jaime Cardenas Chavez;Kamal El-Sankary;Li Chen;Xiaotong Lu","doi":"10.1109/TVLSI.2025.3576998","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576998","url":null,"abstract":"This article presents a 10-bit radiation-hardened-by-design (RHBD) SAR analog-to-digital converter (ADC) operating at 50 MS/s, designed for aerospace applications in high-radiation environments. The system- and circuit-level redundancy techniques are implemented to mitigate radiation-induced errors and metastability. A novel split coarse/fine asynchronous SAR ADC architecture is proposed to provide system-level redundancy. At circuits level, single-event effects (SEEs) error detection and radiation-hardened techniques are implemented. Our co-designed SEE error detection scheme includes last-bit-cycle (LBC) detection following the LSB cycle and metastability detection (MD) via a ramp generator with a threshold trigger. This approach detects and corrects radiation-induced errors using a coarse/fine redundant algorithm. The radiation-hardened latch comparators and D flip-flops (DFFs) are incorporated to further mitigate SEEs. The prototype design is fabricated using TSMC 65-nm technology, with an ADC core area of 0.0875 mm<sup>2</sup> and a power consumption of 2.79 mW at a 1.2-V power supply. Postirradiation tests confirm functionality up to 100-krad(Si) total ionizing dose (TID) and demonstrate over 90% suppression of large SEE under laser testing.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2132-2142"},"PeriodicalIF":2.8,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The big data era has facilitated various memory-centric algorithms, such as the Transformer decoder, neural network, stochastic computing (SC), and genetic sequence matching, which impose high demands on memory capacity, bandwidth, and access power consumption. The emerging nonvolatile memory devices and compute-near-memory (CNM) architecture offer a promising solution for memory-bound tasks. This work proposes a hybrid resistive random access memory (RRAM) and static random access memory (SRAM) CNM architecture. The main contributions include: 1) proposing an energy-efficient and high-density CNM architecture based on the hybrid integration of RRAM and SRAM arrays; 2) designing low-power CNM circuits using the logic gates and dynamic-logic adder with configurable datapath; and 3) proposing a broadcast mechanism with output-stationary workflow to reduce memory access. The proposed RRAM-SRAM CNM architecture and dataflow tailored for four distinct applications are evaluated at a 28-nm technology, achieving 4.62-TOPS$/$ W energy efficiency and 1.20-Mb$/$ mm2 memory density, which shows $11.35times $ –$25.81times $ and $1.44times $ –$4.92times $ improvement compared to previous works, respectively.
{"title":"A High-Density Energy-Efficient CNM Macro Using Hybrid RRAM and SRAM for Memory-Bound Applications","authors":"Jun Wang;Shengzhe Yan;Xiangqu Fu;Zhihang Qian;Zhi Li;Zeyu Guo;Zhuoyu Dai;Zhaori Cong;Chunmeng Dou;Feng Zhang;Jinshan Yue;Dashan Shang","doi":"10.1109/TVLSI.2025.3576889","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576889","url":null,"abstract":"The big data era has facilitated various memory-centric algorithms, such as the Transformer decoder, neural network, stochastic computing (SC), and genetic sequence matching, which impose high demands on memory capacity, bandwidth, and access power consumption. The emerging nonvolatile memory devices and compute-near-memory (CNM) architecture offer a promising solution for memory-bound tasks. This work proposes a hybrid resistive random access memory (RRAM) and static random access memory (SRAM) CNM architecture. The main contributions include: 1) proposing an energy-efficient and high-density CNM architecture based on the hybrid integration of RRAM and SRAM arrays; 2) designing low-power CNM circuits using the logic gates and dynamic-logic adder with configurable datapath; and 3) proposing a broadcast mechanism with output-stationary workflow to reduce memory access. The proposed RRAM-SRAM CNM architecture and dataflow tailored for four distinct applications are evaluated at a 28-nm technology, achieving 4.62-TOPS<inline-formula> <tex-math>$/$ </tex-math></inline-formula>W energy efficiency and 1.20-Mb<inline-formula> <tex-math>$/$ </tex-math></inline-formula>mm<sup>2</sup> memory density, which shows <inline-formula> <tex-math>$11.35times $ </tex-math></inline-formula>–<inline-formula> <tex-math>$25.81times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.44times $ </tex-math></inline-formula>–<inline-formula> <tex-math>$4.92times $ </tex-math></inline-formula> improvement compared to previous works, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2339-2343"},"PeriodicalIF":2.8,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
March algorithms are essential for detecting functional memory faults, characterized by their linear complexity and adaptability to emerging technologies. However, the increasing complexity of fault types presents significant challenges to existing fault detection models regarding analytical efficiency and adaptability. This article introduces the test primitive (TP), a unified notation that characterizes March test sequences through a novel methodology that decouples fault detection operations from sensitization states. The proposed TP achieves platform independence and seamless integration of fault models, supported by rigorous theoretical proofs. These proofs establish the fundamental properties of the TP in terms of completeness, uniqueness, and conciseness, providing a theoretical foundation that ensures the decoupling method reduces the computational complexity of March algorithm analysis to $O(1)$ . This reduction is analogous to Karnaugh map simplification in digital logic while enabling millisecond-level automated analysis. Experimental results demonstrate that the proposed method significantly enhances both analyzable fault coverage (FC) and detection accuracy, thereby addressing critical limitations of existing fault detection models.
{"title":"Test Primitives: The Unified Notation for Characterizing March Test Sequences","authors":"Ruiqi Zhu;Houjun Wang;Susong Yang;Weikun Xie;Yindong Xiao","doi":"10.1109/TVLSI.2025.3577448","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3577448","url":null,"abstract":"March algorithms are essential for detecting functional memory faults, characterized by their linear complexity and adaptability to emerging technologies. However, the increasing complexity of fault types presents significant challenges to existing fault detection models regarding analytical efficiency and adaptability. This article introduces the test primitive (TP), a unified notation that characterizes March test sequences through a novel methodology that decouples fault detection operations from sensitization states. The proposed TP achieves platform independence and seamless integration of fault models, supported by rigorous theoretical proofs. These proofs establish the fundamental properties of the TP in terms of completeness, uniqueness, and conciseness, providing a theoretical foundation that ensures the decoupling method reduces the computational complexity of March algorithm analysis to <inline-formula> <tex-math>$O(1)$ </tex-math></inline-formula>. This reduction is analogous to Karnaugh map simplification in digital logic while enabling millisecond-level automated analysis. Experimental results demonstrate that the proposed method significantly enhances both analyzable fault coverage (FC) and detection accuracy, thereby addressing critical limitations of existing fault detection models.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2542-2555"},"PeriodicalIF":3.1,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Design space exploration (DSE) is crucial for optimizing the performance, power, and area (PPA) of CPU microarchitectures ($mu $ -archs). While various machine learning (ML) algorithms have been applied to the $mu $ -arch DSE problem, the potential of reinforcement learning (RL) remains underexplored. In this article, we propose a novel RL-based approach to address the reduced instruction set computer V (RISC-V) CPU $mu $ -arch DSE problem. This approach enables dynamic selection and optimization of $mu $ -arch parameters without relying on predefined modification sequences, thus significantly enhancing exploration flexibility. To address the challenges posed by high-dimensional action spaces and sparse rewards, we use a discrete soft actor-critic (SAC) framework with entropy maximization to promote efficient exploration. In addition, we integrate multistep temporal-difference (TD) learning, an experience replay (ER) buffer, and return normalization to improve sample efficiency and learning stability during training. Our method further aligns optimization with user-defined preferences by normalizing PPA metrics relative to baseline designs. Experimental results on the Berkeley out-of-order machine (BOOM) demonstrate that the proposed approach achieves superior performance compared with state-of-the-art methods, showcasing its effectiveness and efficiency for $mu $ -arch DSE. Our code is available at https://github.com/exhaust-create/SAC-DSE.
{"title":"Efficient Design Space Exploration for the BOOM Using SAC-Based Reinforcement Learning","authors":"Mingjun Cheng;Shihan Zhang;Xin Zheng;Xian Lin;Huaien Gao;Shuting Cai;Xiaoming Xiong;Bei Yu","doi":"10.1109/TVLSI.2025.3572799","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3572799","url":null,"abstract":"Design space exploration (DSE) is crucial for optimizing the performance, power, and area (PPA) of CPU microarchitectures (<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-archs). While various machine learning (ML) algorithms have been applied to the <inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-arch DSE problem, the potential of reinforcement learning (RL) remains underexplored. In this article, we propose a novel RL-based approach to address the reduced instruction set computer V (RISC-V) CPU <inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-arch DSE problem. This approach enables dynamic selection and optimization of <inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-arch parameters without relying on predefined modification sequences, thus significantly enhancing exploration flexibility. To address the challenges posed by high-dimensional action spaces and sparse rewards, we use a discrete soft actor-critic (SAC) framework with entropy maximization to promote efficient exploration. In addition, we integrate multistep temporal-difference (TD) learning, an experience replay (ER) buffer, and return normalization to improve sample efficiency and learning stability during training. Our method further aligns optimization with user-defined preferences by normalizing PPA metrics relative to baseline designs. Experimental results on the Berkeley out-of-order machine (BOOM) demonstrate that the proposed approach achieves superior performance compared with state-of-the-art methods, showcasing its effectiveness and efficiency for <inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-arch DSE. Our code is available at <uri>https://github.com/exhaust-create/SAC-DSE</uri>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2252-2263"},"PeriodicalIF":2.8,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}