Pub Date : 2025-06-02DOI: 10.1109/TVLSI.2025.3572517
Jiaxin Qing;Philip H. W. Leong;Kin-Hong Lee;Raymond W. Yeung
The network coding enhances performance in network communications and distributed storage by increasing throughput and robustness while reducing latency. Batched sparse (BATS) codes are a class of capacity-achieving network codes, but their practical applications are hindered by their structure, computational intensity, and power demands of finite field (FF) operations. Most literature focuses on algorithmic-level techniques to improve the coding efficiency. Optimization with an algorithm/hardware co-designing approach has long been neglected. Leveraging the unique structure of BATS codes, we first present cyclic-shift BATS (CS-BATS), a hardware-friendly variant. Next, we propose a simple but effective bounded-value (BV) generator, to reduce the size of a finite field multiplier by up to 70%. Finally, we report on a scalable and resource-efficient field-programmable gate array (FPGA)-based network coding accelerator that achieves a throughput of 27 Gb/s, a speedup of more than 300 over software.
{"title":"Toward High-Performance Network Coding: FPGA Acceleration With Bounded-Value Generators","authors":"Jiaxin Qing;Philip H. W. Leong;Kin-Hong Lee;Raymond W. Yeung","doi":"10.1109/TVLSI.2025.3572517","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3572517","url":null,"abstract":"The network coding enhances performance in network communications and distributed storage by increasing throughput and robustness while reducing latency. Batched sparse (BATS) codes are a class of capacity-achieving network codes, but their practical applications are hindered by their structure, computational intensity, and power demands of finite field (FF) operations. Most literature focuses on algorithmic-level techniques to improve the coding efficiency. Optimization with an algorithm/hardware co-designing approach has long been neglected. Leveraging the unique structure of BATS codes, we first present cyclic-shift BATS (CS-BATS), a hardware-friendly variant. Next, we propose a simple but effective bounded-value (BV) generator, to reduce the size of a finite field multiplier by up to 70%. Finally, we report on a scalable and resource-efficient field-programmable gate array (FPGA)-based network coding accelerator that achieves a throughput of 27 Gb/s, a speedup of more than 300 over software.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2274-2287"},"PeriodicalIF":2.8,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A compact power-on-reset (POR) circuit with a configurable brown-out reset (BOR) function is presented. An integrated voltage reference (VR) circuit provides a constant bias voltage that facilitates voltage-triggered POR/BOR operation, reliably preventing POR signal generation when the ramping supply voltage (${V} _{text {DD}}$ ) level is too low. Moreover, the proposed POR circuit features a fast, configurable POR/BOR operation owing to an inverter-based trip point detector (TPD), which triggers the reset signal with a programmable trip point. The prototype POR circuit achieves a POR level higher than 752 mV with a maximum POR delay of $16.4~mu $ s at a 0.8–1.2-V ${V} _{text {DD}}$ , supporting a wide range of supply ramping time from $1~mu $ s to 1 s. In addition, the prototype detects brown-out events with a supply drop of 0.1–0.4 V, generating the BOR signal. Designed using a 28-nm CMOS process, the prototype has a compact active area of $995.3~mu $ m2 and a quiescent current of 162–974 nA at a 1-V ${V} _{text {DD}}$ .
{"title":"A Compact Power-on-Reset Circuit With Configurable Brown-Out Detection","authors":"Yoochang Kim;Jun-Eun Park;Kwanseo Park;Young-Ha Hwang","doi":"10.1109/TVLSI.2025.3561131","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3561131","url":null,"abstract":"A compact power-on-reset (POR) circuit with a configurable brown-out reset (BOR) function is presented. An integrated voltage reference (VR) circuit provides a constant bias voltage that facilitates voltage-triggered POR/BOR operation, reliably preventing POR signal generation when the ramping supply voltage (<inline-formula> <tex-math>${V} _{text {DD}}$ </tex-math></inline-formula>) level is too low. Moreover, the proposed POR circuit features a fast, configurable POR/BOR operation owing to an inverter-based trip point detector (TPD), which triggers the reset signal with a programmable trip point. The prototype POR circuit achieves a POR level higher than 752 mV with a maximum POR delay of <inline-formula> <tex-math>$16.4~mu $ </tex-math></inline-formula>s at a 0.8–1.2-V <inline-formula> <tex-math>${V} _{text {DD}}$ </tex-math></inline-formula>, supporting a wide range of supply ramping time from <inline-formula> <tex-math>$1~mu $ </tex-math></inline-formula>s to 1 s. In addition, the prototype detects brown-out events with a supply drop of 0.1–0.4 V, generating the BOR signal. Designed using a 28-nm CMOS process, the prototype has a compact active area of <inline-formula> <tex-math>$995.3~mu $ </tex-math></inline-formula>m<sup>2</sup> and a quiescent current of 162–974 nA at a 1-V <inline-formula> <tex-math>${V} _{text {DD}}$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2074-2078"},"PeriodicalIF":2.8,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Z-transform is a fundamental and strong tool being widely utilized in signal processing and various other applications such as communications and networking. By analyzing the Z-transform of a signal, one can extract critical information about its stability, causality, frequency response, energy and power, and overall behavior of the signal. However, errors caused either by environmental changes or malicious injections in large-scale integration (VLSI) implementations can critically compromise the integrity and reliability of its output. Failure to detect such faults may result in unpredictable, erroneous, and misleading function analyses. Therefore, the ability to detect soft errors and faults before accepting the results is of paramount importance. In this article, we propose an efficient fault detection method that combines algorithmic-level checks with partial recomputation to identify both transient and permanent faults with a high error coverage rate across various injection scenarios. The AMD/Xilinx field-programmable gate array (FPGA) implementation of our design demonstrated only a modest increase in time and area overhead. To the best of our knowledge, fault detection for the Z-transform function has not been previously studied.
{"title":"Efficient Partial Recomputation-Based Fault Detection Approaches for Z-transform","authors":"Saeed Aghapour;Kasra Ahmadi;Mehran Mozaffari Kermani;Reza Azarderakhsh","doi":"10.1109/TVLSI.2025.3560154","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3560154","url":null,"abstract":"The Z-transform is a fundamental and strong tool being widely utilized in signal processing and various other applications such as communications and networking. By analyzing the Z-transform of a signal, one can extract critical information about its stability, causality, frequency response, energy and power, and overall behavior of the signal. However, errors caused either by environmental changes or malicious injections in large-scale integration (VLSI) implementations can critically compromise the integrity and reliability of its output. Failure to detect such faults may result in unpredictable, erroneous, and misleading function analyses. Therefore, the ability to detect soft errors and faults before accepting the results is of paramount importance. In this article, we propose an efficient fault detection method that combines algorithmic-level checks with partial recomputation to identify both transient and permanent faults with a high error coverage rate across various injection scenarios. The AMD/Xilinx field-programmable gate array (FPGA) implementation of our design demonstrated only a modest increase in time and area overhead. To the best of our knowledge, fault detection for the Z-transform function has not been previously studied.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1983-1993"},"PeriodicalIF":2.8,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144581497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-29DOI: 10.1109/TVLSI.2025.3562015
Guan-Rong Chen;Kuen-Jong Lee
Integrated circuits (ICs) have become extremely complex nowadays. Therefore, multiple test standards could be employed to handle different testing scenarios. Unfortunately, this also leads to serious security problems since attackers can exploit the excellent controllability and observability of test standards to steal confidential information or disrupt the circuit’s functionality. This article proposes a universal sequential authentication scheme that is compatible with test standards employing the test access port controller (TAPC) defined in IEEE Std 1149.1. The main objective is to protect multiple TAPC-based test standards with a universal security module. In this scheme, only authorized test data can be updated to the target register to control the corresponding test standard, and only the response to authorized test data can be output. The key idea is to generate different authentication keys for different test data, and even with the same set of test data, if their input sequences are different, their authentication keys will also be different. Furthermore, we develop an irreversible obfuscation mechanism to generate fake output data to confuse attackers. Due to its irreversibility, the original correct output data cannot be deduced from the fake output data. Experimental results on a typical processor, i.e., SCR1, show that the proposed scheme causes no time overhead, and the area overhead is only 1.74%.
{"title":"A Universal Sequential Authentication Scheme for TAPC-Based Test Standards","authors":"Guan-Rong Chen;Kuen-Jong Lee","doi":"10.1109/TVLSI.2025.3562015","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3562015","url":null,"abstract":"Integrated circuits (ICs) have become extremely complex nowadays. Therefore, multiple test standards could be employed to handle different testing scenarios. Unfortunately, this also leads to serious security problems since attackers can exploit the excellent controllability and observability of test standards to steal confidential information or disrupt the circuit’s functionality. This article proposes a universal sequential authentication scheme that is compatible with test standards employing the test access port controller (TAPC) defined in IEEE Std 1149.1. The main objective is to protect multiple TAPC-based test standards with a universal security module. In this scheme, only authorized test data can be updated to the target register to control the corresponding test standard, and only the response to authorized test data can be output. The key idea is to generate different authentication keys for different test data, and even with the same set of test data, if their input sequences are different, their authentication keys will also be different. Furthermore, we develop an irreversible obfuscation mechanism to generate fake output data to confuse attackers. Due to its irreversibility, the original correct output data cannot be deduced from the fake output data. Experimental results on a typical processor, i.e., SCR1, show that the proposed scheme causes no time overhead, and the area overhead is only 1.74%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1972-1982"},"PeriodicalIF":2.8,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In fifth-generation (5G) communication systems, multiple input multiple output (MIMO) and orthogonal frequency-division multiplexing (OFDM) are two critical technologies. Fast Fourier transform (FFT), as the core processing steps of OFDM, directly affects the overall system performance. In this brief, we proposed a novel block-level pipelined architecture, which divides the FFT processor into three pipeline blocks: input, radix, and output. Each pipeline block can run in a different FFT simultaneously to achieve higher throughput. Specifically, to reduce the OFDM system-level latency of 5G applications, the FFT processor supports weighted overlap and add (WOLA) on the cyclic prefix and suffix of OFDM symbols. This architecture is implemented using TSMC 12-nm technology, with a processor die area of 0.89 mm2 and a power consumption of 568 mW at 1 GHz. The FFT processor can achieve a system-level throughput up to 2.66 GS/s.
{"title":"A Novel High-Throughput FFT Processor With a Block-Level Pipeline for 5G MIMO OFDM Systems","authors":"Meiyu Liu;Zhijun Wang;Hanqing Luo;Shengnan Lin;Liping Liang","doi":"10.1109/TVLSI.2025.3558947","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558947","url":null,"abstract":"In fifth-generation (5G) communication systems, multiple input multiple output (MIMO) and orthogonal frequency-division multiplexing (OFDM) are two critical technologies. Fast Fourier transform (FFT), as the core processing steps of OFDM, directly affects the overall system performance. In this brief, we proposed a novel block-level pipelined architecture, which divides the FFT processor into three pipeline blocks: input, radix, and output. Each pipeline block can run in a different FFT simultaneously to achieve higher throughput. Specifically, to reduce the OFDM system-level latency of 5G applications, the FFT processor supports weighted overlap and add (WOLA) on the cyclic prefix and suffix of OFDM symbols. This architecture is implemented using TSMC 12-nm technology, with a processor die area of 0.89 mm<sup>2</sup> and a power consumption of 568 mW at 1 GHz. The FFT processor can achieve a system-level throughput up to 2.66 GS/s.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2059-2063"},"PeriodicalIF":2.8,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article proposes a fully differential ten-bit energy-efficient successive approximation register (SAR) analog-to-digital converter (ADC) for wearable 12-lead electrocardiogram (ECG) acquisition system. The proposed ADC structure generates two bypass windows through capacitor splitting technique, which can skip unnecessary quantization steps. The judgment module of bypass windows only requires an XOR gate. By introducing redundant capacitors to participate in quantization, the total capacitance value is reduced by half. The proposed SAR ADC is fabricated using a standard 180-nm CMOS process. The measurement results show that it can achieve an effective number of bits (ENOBs) of 9.38 bits and a spurious-free dynamic range (SFDR) of 76.71 dB with a supply voltage of 0.6 V at a sampling rate ($text{F}_{mathrm {S}}$ ) of 6.94 kS/s. The power consumption is 15.61 nW when subjected to a 1.17-$text{V}_{mathrm {PP}}~3.45$ -kHz sinusoidal input, resulting in a figure of merit (FoM) of 3.38 fJ/conv.-step. The average power consumption for quantizing 12-lead ECG signals is approximately 12.66 nW, demonstrating the ability to achieve ultralow-power quantization of ECG signals.
{"title":"A 0.6-V 9.38-Bit 6.9-kS/s Capacitor-Splitting Bypass Window SAR ADC for Wearable 12-Lead ECG Acquisition Systems","authors":"Kangkang Sun;Jingjing Liu;Feng Yan;Yuan Ren;Ruihuang Wu;Bingjun Xiong;Zhipeng Li;Jian Guan","doi":"10.1109/TVLSI.2025.3559669","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3559669","url":null,"abstract":"This article proposes a fully differential ten-bit energy-efficient successive approximation register (SAR) analog-to-digital converter (ADC) for wearable 12-lead electrocardiogram (ECG) acquisition system. The proposed ADC structure generates two bypass windows through capacitor splitting technique, which can skip unnecessary quantization steps. The judgment module of bypass windows only requires an <sc>XOR</small> gate. By introducing redundant capacitors to participate in quantization, the total capacitance value is reduced by half. The proposed SAR ADC is fabricated using a standard 180-nm CMOS process. The measurement results show that it can achieve an effective number of bits (ENOBs) of 9.38 bits and a spurious-free dynamic range (SFDR) of 76.71 dB with a supply voltage of 0.6 V at a sampling rate (<inline-formula> <tex-math>$text{F}_{mathrm {S}}$ </tex-math></inline-formula>) of 6.94 kS/s. The power consumption is 15.61 nW when subjected to a 1.17-<inline-formula> <tex-math>$text{V}_{mathrm {PP}}~3.45$ </tex-math></inline-formula>-kHz sinusoidal input, resulting in a figure of merit (FoM) of 3.38 fJ/conv.-step. The average power consumption for quantizing 12-lead ECG signals is approximately 12.66 nW, demonstrating the ability to achieve ultralow-power quantization of ECG signals.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1838-1847"},"PeriodicalIF":2.8,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-28DOI: 10.1109/TVLSI.2025.3561507
Kunyao Lai;Enyi Yao;Zhenxing Li;Yongkui Yang
Embedded DRAM (eDRAM) has been widely adopted as on-chip cache memory in modern processors due to its high density. In this article, we propose a 2T gain-cell eDRAM-based macro that functions not only as traditional cache memory but also as an in-memory computing unit capable of performing logic operations. Furthermore, this eDRAM macro features in situ storing, completely eliminating the need for external memory or register access during computation. The sense amplifier in this macro is equipped with a programmable voltage reference, enabling support for various Boolean logic operations, including and/nand, or/nor, and not. In addition, the macro integrates a transmission-gate (TG)-based shifter cluster to perform data shifting, which is commonly required in general computations. To enhance functionality, we design an instruction set that supports compound logic computations, allowing Boolean logic, shifting, and in situ storage to be executed within a single instruction. We validated this eDRAM macro in a 32-kb bitcell array using the 40-nm logic CMOS technology. Compared with state-of-the-art designs, our macro achieves a relatively high density of 729.2 kb/mm2 and a competitive logic energy of 14.1 fJ/bit.
{"title":"A High-Density eDRAM Macro With Programmable Sense Amplifier and TG-Shifter for Logical-Instruction-Based In-Memory Computing","authors":"Kunyao Lai;Enyi Yao;Zhenxing Li;Yongkui Yang","doi":"10.1109/TVLSI.2025.3561507","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3561507","url":null,"abstract":"Embedded DRAM (eDRAM) has been widely adopted as on-chip cache memory in modern processors due to its high density. In this article, we propose a 2T gain-cell eDRAM-based macro that functions not only as traditional cache memory but also as an in-memory computing unit capable of performing logic operations. Furthermore, this eDRAM macro features in situ storing, completely eliminating the need for external memory or register access during computation. The sense amplifier in this macro is equipped with a programmable voltage reference, enabling support for various Boolean logic operations, including <sc>and</small>/<sc>nand</small>, <sc>or</small>/<sc>nor</small>, and <sc>not</small>. In addition, the macro integrates a transmission-gate (TG)-based shifter cluster to perform data shifting, which is commonly required in general computations. To enhance functionality, we design an instruction set that supports compound logic computations, allowing Boolean logic, shifting, and in situ storage to be executed within a single instruction. We validated this eDRAM macro in a 32-kb bitcell array using the 40-nm logic CMOS technology. Compared with state-of-the-art designs, our macro achieves a relatively high density of 729.2 kb/mm<sup>2</sup> and a competitive logic energy of 14.1 fJ/bit.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2069-2073"},"PeriodicalIF":2.8,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-25DOI: 10.1109/TVLSI.2025.3559403
Abdul Rahoof;Vivek Chaturvedi;Mahesh Raveendranatha Panicker;Muhammad Shafique
In recent years, there has been a growing trend in accelerating computationally complex nonreal-time beamforming algorithms in ultrasound imaging using deep learning models. However, due to the large size and complexity, these state-of-the-art deep learning techniques pose significant challenges when deploying on resource-constrained edge devices. In this work, we propose a novel capsule network-based beamformer called CapsBeam, designed to operate on raw radio frequency data and provide an envelope of beamformed data through nonsteered plane-wave insonification. In experiments on in vivo data, CapsBeam reduced artifacts compared to the standard Delay-and-Sum (DAS) beamforming. For in vitro data, CapsBeam demonstrated a 32.31% increase in contrast, along with gains of 16.54% and 6.7% in axial and lateral resolution compared to the DAS. Similarly, in silico data showed a 26% enhancement in contrast, along with improvements of 13.6% and 21.5% in axial and lateral resolution, respectively, compared to the DAS. To reduce the parameter redundancy and enhance the computational efficiency, we pruned the model using our multilayer look-ahead kernel pruning (LAKP-ML) methodology, achieving a compression ratio of 85% without affecting the image quality. Additionally, the hardware complexity of the proposed model is reduced by applying quantization, simplification of nonlinear operations, and parallelizing operations. Finally, we proposed a specialized accelerator architecture for the pruned and optimized CapsBeam model, implemented on a Xilinx ZU7EV FPGA. The proposed accelerator achieved a throughput of 30 GOPS for the convolution operation and 17.4 GOPS for the dynamic routing operation.
{"title":"CapsBeam: Accelerating Capsule Network-Based Beamformer for Ultrasound Nonsteered Plane-Wave Imaging on Field-Programmable Gate Array","authors":"Abdul Rahoof;Vivek Chaturvedi;Mahesh Raveendranatha Panicker;Muhammad Shafique","doi":"10.1109/TVLSI.2025.3559403","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3559403","url":null,"abstract":"In recent years, there has been a growing trend in accelerating computationally complex nonreal-time beamforming algorithms in ultrasound imaging using deep learning models. However, due to the large size and complexity, these state-of-the-art deep learning techniques pose significant challenges when deploying on resource-constrained edge devices. In this work, we propose a novel capsule network-based beamformer called CapsBeam, designed to operate on raw radio frequency data and provide an envelope of beamformed data through nonsteered plane-wave insonification. In experiments on in vivo data, CapsBeam reduced artifacts compared to the standard Delay-and-Sum (DAS) beamforming. For in vitro data, CapsBeam demonstrated a 32.31% increase in contrast, along with gains of 16.54% and 6.7% in axial and lateral resolution compared to the DAS. Similarly, in silico data showed a 26% enhancement in contrast, along with improvements of 13.6% and 21.5% in axial and lateral resolution, respectively, compared to the DAS. To reduce the parameter redundancy and enhance the computational efficiency, we pruned the model using our multilayer look-ahead kernel pruning (LAKP-ML) methodology, achieving a compression ratio of 85% without affecting the image quality. Additionally, the hardware complexity of the proposed model is reduced by applying quantization, simplification of nonlinear operations, and parallelizing operations. Finally, we proposed a specialized accelerator architecture for the pruned and optimized CapsBeam model, implemented on a Xilinx ZU7EV FPGA. The proposed accelerator achieved a throughput of 30 GOPS for the convolution operation and 17.4 GOPS for the dynamic routing operation.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1934-1944"},"PeriodicalIF":2.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-25DOI: 10.1109/TVLSI.2025.3557605
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information","authors":"","doi":"10.1109/TVLSI.2025.3557605","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557605","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"C3-C3"},"PeriodicalIF":2.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10977653","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-25DOI: 10.1109/TVLSI.2025.3557603
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information","authors":"","doi":"10.1109/TVLSI.2025.3557603","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557603","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"C2-C2"},"PeriodicalIF":2.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10977654","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}