IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献_第10页

Test Primitives: The Unified Notation for Characterizing March Test Sequences 测试原语：描述三月测试序列的统一符号

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-06-17 DOI: 10.1109/TVLSI.2025.3577448

Ruiqi Zhu;Houjun Wang;Susong Yang;Weikun Xie;Yindong Xiao

March algorithms are essential for detecting functional memory faults, characterized by their linear complexity and adaptability to emerging technologies. However, the increasing complexity of fault types presents significant challenges to existing fault detection models regarding analytical efficiency and adaptability. This article introduces the test primitive (TP), a unified notation that characterizes March test sequences through a novel methodology that decouples fault detection operations from sensitization states. The proposed TP achieves platform independence and seamless integration of fault models, supported by rigorous theoretical proofs. These proofs establish the fundamental properties of the TP in terms of completeness, uniqueness, and conciseness, providing a theoretical foundation that ensures the decoupling method reduces the computational complexity of March algorithm analysis to

$O(1)$

. This reduction is analogous to Karnaugh map simplification in digital logic while enabling millisecond-level automated analysis. Experimental results demonstrate that the proposed method significantly enhances both analyzable fault coverage (FC) and detection accuracy, thereby addressing critical limitations of existing fault detection models.

行军算法具有线性复杂性和对新兴技术的适应性，是检测功能性记忆故障的关键。然而，故障类型日益复杂，对现有故障检测模型的分析效率和适应性提出了重大挑战。本文介绍了测试原语（TP），这是一种统一的符号，通过一种新颖的方法来表征March测试序列，该方法将故障检测操作与敏化状态解耦。该方法实现了故障模型的平台无关性和无缝集成，并有严格的理论证明。这些证明建立了TP在完备性、唯一性和简洁性方面的基本性质，为确保解耦方法将March算法分析的计算复杂度降低到$O(1)$提供了理论基础。这种减少类似于数字逻辑中的卡诺地图简化，同时实现毫秒级的自动分析。实验结果表明，该方法显著提高了可分析故障覆盖率（FC）和检测精度，从而解决了现有故障检测模型的关键局限性。

{"title":"Test Primitives: The Unified Notation for Characterizing March Test Sequences","authors":"Ruiqi Zhu;Houjun Wang;Susong Yang;Weikun Xie;Yindong Xiao","doi":"10.1109/TVLSI.2025.3577448","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3577448","url":null,"abstract":"March algorithms are essential for detecting functional memory faults, characterized by their linear complexity and adaptability to emerging technologies. However, the increasing complexity of fault types presents significant challenges to existing fault detection models regarding analytical efficiency and adaptability. This article introduces the test primitive (TP), a unified notation that characterizes March test sequences through a novel methodology that decouples fault detection operations from sensitization states. The proposed TP achieves platform independence and seamless integration of fault models, supported by rigorous theoretical proofs. These proofs establish the fundamental properties of the TP in terms of completeness, uniqueness, and conciseness, providing a theoretical foundation that ensures the decoupling method reduces the computational complexity of March algorithm analysis to <inline-formula> <tex-math>$O(1)$ </tex-math></inline-formula>. This reduction is analogous to Karnaugh map simplification in digital logic while enabling millisecond-level automated analysis. Experimental results demonstrate that the proposed method significantly enhances both analyzable fault coverage (FC) and detection accuracy, thereby addressing critical limitations of existing fault detection models.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2542-2555"},"PeriodicalIF":3.1,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A 66-Gb/s/5.5-W RISC-V Many-Core Cluster for 5G+ Software-Defined Radio Uplinks 5G+软件定义无线电上行链路66gb /s/5.5 w RISC-V多核集群

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-06-17 DOI: 10.1109/TVLSI.2025.3576855

Marco Bertuletti;Yichao Zhang;Alessandro Vanelli-Coralli;Luca Benini

Following the scale-up of new radio (NR) complexity in 5G and beyond, the physical layer’s computing load on base stations is increasing under a strictly constrained latency and power budget; base stations must process

$gt$

20-Gb/s uplink wireless data rate on the fly, in

$lt$

10 W. At the same time, the programmability and reconfigurability of base station components are the key requirements; it reduces the time and cost of new networks’ deployment, it lowers the acceptance threshold for industry players to enter the market, and it ensures return on investments in a fast-paced evolution of standards. In this article, we present the design of a many-core cluster for 5G and beyond base station processing. Our design features 1024, streamlined RISC-V cores with domain-specific FP extensions, and 4-MiB shared memory. It provides the necessary computational capabilities for software-defined processing of the lower physical layer of 5G physical uplink shared channel (PUSCH), satisfying high-end throughput requirements (66 Gb/s for a transition time interval (TTI), 9.4–302 Gb/s depending on the processing stage). The throughput metrics for the implemented functions are ten times higher than in state-of-the-art (SoTA) application-specific instruction processors (ASIPs). The energy efficiency on key NR kernels (2–41 Gb/s/W), measured at 800 MHz,

${25}~^{circ } $

C, and 0.8 V, on a placed and routed instance in 12-nm CMOS technology, is competitive with SoTA architectures. The PUSCH processing runs end-to-end on a single cluster in 1.7 ms, at <6-W average power consumption, achieving 12 Gb/s/W.

随着5G及以后新无线电（NR）复杂性的扩大，在严格限制的延迟和功率预算下，基站的物理层计算负载正在增加；基站必须在飞行中处理$ $ 20 gb /s的上行无线数据速率，在$ $ $ 10 W。同时，对基站组件的可编程性和可重构性提出了关键要求；它减少了新网络部署的时间和成本，降低了行业参与者进入市场的接受门槛，并确保了在快速发展的标准中获得投资回报。在本文中，我们介绍了用于5G及以上基站处理的多核集群的设计。我们的设计具有1024个流线型RISC-V内核，具有特定领域的FP扩展和4 mib共享内存。它为5G物理上行共享信道（PUSCH）下物理层的软件定义处理提供了必要的计算能力，满足高端吞吐量需求（TTI时间间隔66 Gb/s，根据处理阶段9.4-302 Gb/s）。实现功能的吞吐量指标比最先进的（SoTA）特定于应用程序的指令处理器（asip）高10倍。关键NR内核（2-41 Gb/s/W）的能量效率，在800 MHz， ${25}~^{circ} $ C和0.8 V下测量，在12纳米CMOS技术的放置和路由实例上，与SoTA架构具有竞争力。PUSCH处理端到端在单个集群上运行，时间为1.7 ms，平均功耗<6 W，达到12 Gb/s/W。

{"title":"A 66-Gb/s/5.5-W RISC-V Many-Core Cluster for 5G+ Software-Defined Radio Uplinks","authors":"Marco Bertuletti;Yichao Zhang;Alessandro Vanelli-Coralli;Luca Benini","doi":"10.1109/TVLSI.2025.3576855","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576855","url":null,"abstract":"Following the scale-up of new radio (NR) complexity in 5G and beyond, the physical layer’s computing load on base stations is increasing under a strictly constrained latency and power budget; base stations must process <inline-formula> <tex-math>$gt$ </tex-math></inline-formula> 20-Gb/s uplink wireless data rate on the fly, in <inline-formula> <tex-math>$lt$ </tex-math></inline-formula> 10 W. At the same time, the programmability and reconfigurability of base station components are the key requirements; it reduces the time and cost of new networks’ deployment, it lowers the acceptance threshold for industry players to enter the market, and it ensures return on investments in a fast-paced evolution of standards. In this article, we present the design of a many-core cluster for 5G and beyond base station processing. Our design features 1024, streamlined RISC-V cores with domain-specific FP extensions, and 4-MiB shared memory. It provides the necessary computational capabilities for software-defined processing of the lower physical layer of 5G physical uplink shared channel (PUSCH), satisfying high-end throughput requirements (66 Gb/s for a transition time interval (TTI), 9.4–302 Gb/s depending on the processing stage). The throughput metrics for the implemented functions are ten times higher than in state-of-the-art (SoTA) application-specific instruction processors (ASIPs). The energy efficiency on key NR kernels (2–41 Gb/s/W), measured at 800 MHz, <inline-formula> <tex-math>${25}~^{circ } $ </tex-math></inline-formula>C, and 0.8 V, on a placed and routed instance in 12-nm CMOS technology, is competitive with SoTA architectures. The PUSCH processing runs end-to-end on a single cluster in 1.7 ms, at <6-W average power consumption, achieving 12 Gb/s/W.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2225-2238"},"PeriodicalIF":2.8,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SC-IMC: Algorithm-Architecture Co-Optimized SRAM-Based In-Memory Computing for Sine/Cosine and Convolutional Acceleration SC-IMC：基于sram的正弦/余弦和卷积加速的算法架构协同优化内存计算

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-06-11 DOI: 10.1109/TVLSI.2025.3573753

Qi Cao;Shang Wang;Haisheng Fu;Qifan Gao;Zhenjiao Chen;Li Gao;Feng Liang

Sine/cosine (SC) is widely used in practical engineering applications, such as image compression and motor control. Nevertheless, due to power sensitivity and speed demands, SC acceleration suffers from limitations in traditional von-Neumann architectures. To overcome this challenge, we propose accelerating SC and convolution using a static random access memory (SRAM)-based in-memory computing (IMC) architecture through an algorithm-architecture co-optimization manner. We develop the first SC algorithm that transforms nonlinear operations into the IMC paradigm, enabling IMC array to handle both SC and artificial intelligence (AI) tasks and making the IMC array a reusable module. Our architecture extends computing functions of macro dedicated to convolutional neural networks (CNNs), with less than a 1% area increase. The proposed SC algorithm for FP32 data achieves high accuracy within 1 unit in the least significant place (ulp) error margin compared with C math library. Moreover, we build an intelligent IMC system that supports various CNNs. Our IMC macro implements 512-kb binary weight storage within 3.0366-mm² area in SMIC 28-nm technology and presents area/energy efficiency of 2160.29–270.04 GOPS/mm² and 513.95–8.03 TOPS/W in CNN mode. The proposed algorithm and architecture facilitate the integration of more nonlinear functions into IMC with minimal area overhead.

正弦/余弦（SC）在实际工程应用中得到了广泛的应用，如图像压缩和电机控制。然而，由于功率灵敏度和速度要求，SC加速在传统的冯-诺伊曼架构中受到限制。为了克服这一挑战，我们提出通过算法-架构协同优化的方式，使用基于静态随机存取存储器（SRAM）的内存计算（IMC）架构来加速SC和卷积。我们开发了第一个将非线性操作转换为IMC范式的SC算法，使IMC阵列能够处理SC和人工智能（AI）任务，并使IMC阵列成为可重复使用的模块。我们的架构扩展了卷积神经网络（cnn）专用宏的计算功能，面积增加不到1%。与C数学库相比，本文提出的SC算法在FP32数据的最小有效位（ulp）误差范围在1个单位以内，具有较高的精度。此外，我们还构建了一个支持各种cnn的智能IMC系统。我们的IMC宏采用中芯国际28纳米技术，在3.0366 mm2的面积内实现了512 kb二进制权重存储，其面积/能量效率为2160.29-270.04 GOPS/mm2，在CNN模式下为513.95-8.03 TOPS/W。所提出的算法和体系结构有助于以最小的面积开销将更多的非线性函数集成到IMC中。

{"title":"SC-IMC: Algorithm-Architecture Co-Optimized SRAM-Based In-Memory Computing for Sine/Cosine and Convolutional Acceleration","authors":"Qi Cao;Shang Wang;Haisheng Fu;Qifan Gao;Zhenjiao Chen;Li Gao;Feng Liang","doi":"10.1109/TVLSI.2025.3573753","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573753","url":null,"abstract":"Sine/cosine (SC) is widely used in practical engineering applications, such as image compression and motor control. Nevertheless, due to power sensitivity and speed demands, SC acceleration suffers from limitations in traditional von-Neumann architectures. To overcome this challenge, we propose accelerating SC and convolution using a static random access memory (SRAM)-based in-memory computing (IMC) architecture through an algorithm-architecture co-optimization manner. We develop the first SC algorithm that transforms nonlinear operations into the IMC paradigm, enabling IMC array to handle both SC and artificial intelligence (AI) tasks and making the IMC array a reusable module. Our architecture extends computing functions of macro dedicated to convolutional neural networks (CNNs), with less than a 1% area increase. The proposed SC algorithm for FP32 data achieves high accuracy within 1 unit in the least significant place (ulp) error margin compared with <italic>C math library. Moreover, we build an intelligent IMC system that supports various CNNs. Our IMC macro implements 512-kb binary weight storage within 3.0366-mm2 area in SMIC 28-nm technology and presents area/energy efficiency of 2160.29–270.04 GOPS/mm2 and 513.95–8.03 TOPS/W in CNN mode. The proposed algorithm and architecture facilitate the integration of more nonlinear functions into IMC with minimal area overhead.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2200-2213"},"PeriodicalIF":2.8,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Fourth-Order Tunable Bandwidth Gm-C Filter for ECG Detection Achieving −7.9 dBV IIP3 Under a 0.5 V Supply 一种用于心电检测的四阶可调带宽Gm-C滤波器，在0.5 V电源下实现−7.9 dBV IIP3

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-06-11 DOI: 10.1109/TVLSI.2025.3576360

Farzan Rezaei;Loai G. Salem

This article introduces a fourth-order

$G_{m}$

-C low-pass filter for ECG detection that achieves high linearity despite operating under a 0.5 V supply by 1) placing the differential pairs (DPs) of the employed

$G_{m}$

stages in a two-loop feedback structure, 2) employing body-driven rather than gate-driven

$G_{m}$

DPs, and 3) using current mirrors in place of cascoded transistors in a conventional

$G_{m}$

stage. Measurement results of a

$0.18~mu $

m CMOS prototype show that the proposed filter, operating with a

$V_{text {DD}}$

of 0.5 V, achieves an third-order harmonic distortion (HD3) below −40 dB for input amplitudes up to 340 mV_pp. With an integrated noise of

$154.7~mu $

V_rms over a 240-Hz bandwidth, the filter exhibits a dynamic range (DR) of 53.6 dB, which is competitive with previously reported works.

本文介绍了一种用于ECG检测的四阶$G_{m}$ -C低通滤波器，尽管在0.5 V电源下工作，但通过1)将所采用的$G_{m}$级的差分对（DPs）置于双环反馈结构中，2)采用体驱动而不是门驱动的$G_{m}$ DPs，以及3)在传统的$G_{m}$级中使用电流镜代替级联编码晶体管，实现了高线性。在$0.18~mu $ m CMOS样机上的测量结果表明，该滤波器在$V_{text {DD}}$ 0.5 V的电压下工作，在高达340 mVpp的输入幅值下实现了低于- 40 dB的三阶谐波失真（HD3）。该滤波器在240hz带宽上的集成噪声为$154.7~mu $ Vrms，动态范围（DR）为53.6 dB，与先前报道的作品相比具有竞争力。

{"title":"A Fourth-Order Tunable Bandwidth Gm-C Filter for ECG Detection Achieving −7.9 dBV IIP3 Under a 0.5 V Supply","authors":"Farzan Rezaei;Loai G. Salem","doi":"10.1109/TVLSI.2025.3576360","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576360","url":null,"abstract":"This article introduces a fourth-order <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula>-C low-pass filter for ECG detection that achieves high linearity despite operating under a 0.5 V supply by 1) placing the differential pairs (DPs) of the employed <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula> stages in a two-loop feedback structure, 2) employing body-driven rather than gate-driven <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula> DPs, and 3) using current mirrors in place of cascoded transistors in a conventional <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula> stage. Measurement results of a <inline-formula> <tex-math>$0.18~mu $ </tex-math></inline-formula>m CMOS prototype show that the proposed filter, operating with a <inline-formula> <tex-math>$V_{text {DD}}$ </tex-math></inline-formula> of 0.5 V, achieves an third-order harmonic distortion (HD3) below −40 dB for input amplitudes up to 340 mVpp. With an integrated noise of <inline-formula> <tex-math>$154.7~mu $ </tex-math></inline-formula>Vrms over a 240-Hz bandwidth, the filter exhibits a dynamic range (DR) of 53.6 dB, which is competitive with previously reported works.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2438-2448"},"PeriodicalIF":3.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Architectures for Serial and Parallel Pipelined NTT-Based Polynomial Modular Multiplication 基于ntt的串行和并行流水线多项式模乘法体系结构

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-06-11 DOI: 10.1109/TVLSI.2025.3576782

Sin-Wei Chiu;Keshab K. Parhi

Quantum computers pose a significant threat to modern cryptographic systems by efficiently solving problems such as integer factorization through Shor’s algorithm. Homomorphic encryption (HE) schemes based on ring learning with errors (Ring-LWE) offer a quantum-resistant framework for secure computations on encrypted data. Many of these schemes rely on polynomial multiplication, which can be efficiently accelerated using the number theoretic transform (NTT) in leveled HE, ensuring practical performance for privacy-preserving applications. This article presents a novel NTT-based serial pipelined multiplier that achieves full-hardware utilization through interleaved folding, and overcomes the 50% under-utilization limitation of the conventional serial R2MDC architecture. In addition, it explores tradeoffs in pipelined parallel designs, including serial, 2-parallel, and 4-parallel architectures. Our designs leverage increased parallelism, efficient folding techniques, and optimizations for a selected constant modulus to achieve superior throughput (TP) compared with state-of-the-art implementations. While the serial fold design minimizes area consumption, the 4-parallel design maximizes TP. Experimental results on the Virtex-7 platform demonstrate that our architectures achieve at least 2.22 times higher TP/area for a polynomial length of 1024 and 1.84 times for a polynomial length of 4096 in the serial fold design, while the 4-parallel design achieves at least 2.78 times and 2.79 times, respectively. The efficiency gain is even more pronounced in TP squared over area, where the serial fold and 4-parallel designs outperform prior works by at least 4.98 times and 26.43 times for a polynomial length of 1024 and 6.7 times and 43.77 times for a polynomial length of 4096, respectively. These results highlight the effectiveness of our architectures in balancing performance, area efficiency, and flexibility, making them well-suited for high-speed cryptographic applications.

量子计算机通过肖尔算法高效地解决整数分解等问题，对现代密码系统构成了重大威胁。基于带误差环学习（ring - lwe）的同态加密（HE）方案为加密数据的安全计算提供了一个抗量子框架。这些方案中的许多都依赖于多项式乘法，可以使用数论变换（NTT）在水平HE中有效地加速多项式乘法，从而确保隐私保护应用的实际性能。本文提出了一种新的基于ntt的串行流水线乘法器，该乘法器通过交错折叠实现了全硬件利用率，克服了传统串行R2MDC架构50%的利用率不足限制。此外，它还探讨了流水线并行设计中的权衡，包括串行、2并行和4并行架构。我们的设计利用了更高的并行性、高效的折叠技术，并对选定的恒定模量进行了优化，与最先进的实现相比，实现了更高的吞吐量（TP）。而串行折叠设计最大限度地减少面积消耗，4并行设计最大限度地提高TP。在Virtex-7平台上的实验结果表明，我们的架构在串行折叠设计中，当多项式长度为1024时，TP/面积至少提高2.22倍，当多项式长度为4096时，TP/面积至少提高1.84倍，而在4并行设计中，TP/面积分别至少提高2.78倍和2.79倍。效率增益在TP平方/面积上更加明显，其中，当多项式长度为1024时，串行折叠和4并行设计的性能分别优于先前的作品至少4.98倍和26.43倍，而当多项式长度为4096时，效率增益分别为6.7倍和43.77倍。这些结果突出了我们的架构在平衡性能、面积效率和灵活性方面的有效性，使它们非常适合高速加密应用。

{"title":"Architectures for Serial and Parallel Pipelined NTT-Based Polynomial Modular Multiplication","authors":"Sin-Wei Chiu;Keshab K. Parhi","doi":"10.1109/TVLSI.2025.3576782","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576782","url":null,"abstract":"Quantum computers pose a significant threat to modern cryptographic systems by efficiently solving problems such as integer factorization through Shor’s algorithm. Homomorphic encryption (HE) schemes based on ring learning with errors (Ring-LWE) offer a quantum-resistant framework for secure computations on encrypted data. Many of these schemes rely on polynomial multiplication, which can be efficiently accelerated using the number theoretic transform (NTT) in leveled HE, ensuring practical performance for privacy-preserving applications. This article presents a novel NTT-based serial pipelined multiplier that achieves full-hardware utilization through interleaved folding, and overcomes the 50% under-utilization limitation of the conventional serial R2MDC architecture. In addition, it explores tradeoffs in pipelined parallel designs, including serial, 2-parallel, and 4-parallel architectures. Our designs leverage increased parallelism, efficient folding techniques, and optimizations for a selected constant modulus to achieve superior throughput (TP) compared with state-of-the-art implementations. While the serial fold design minimizes area consumption, the 4-parallel design maximizes TP. Experimental results on the Virtex-7 platform demonstrate that our architectures achieve at least 2.22 times higher TP/area for a polynomial length of 1024 and 1.84 times for a polynomial length of 4096 in the serial fold design, while the 4-parallel design achieves at least 2.78 times and 2.79 times, respectively. The efficiency gain is even more pronounced in TP squared over area, where the serial fold and 4-parallel designs outperform prior works by at least 4.98 times and 26.43 times for a polynomial length of 1024 and 6.7 times and 43.77 times for a polynomial length of 4096, respectively. These results highlight the effectiveness of our architectures in balancing performance, area efficiency, and flexibility, making them well-suited for high-speed cryptographic applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2474-2487"},"PeriodicalIF":3.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Soft Iterative Receiver With Simplified EP Detection for Coded MIMO Systems 编码MIMO系统中简化EP检测的软迭代接收机

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-06-09 DOI: 10.1109/TVLSI.2025.3536019

Xiaosi Tan;Xiaohua Xie;Houren Ji;Tiancan Xia;Yongming Huang;Xiaohu You;Chuan Zhang

Expectation propagation (EP) achieves excellent performance with high-order modulation in massive multiple-input multiple-output (MIMO) detection. The soft output of the EP detector can be iteratively combined with turbo soft decoders to enhance error-correction performance. However, the implementation of EP-based iterative detection and decoding (IDD) receivers suffer from an exponential increase in computational complexity as the number of antennas and modulation order grows. In this brief, we propose a simplified EP approximation-based IDD (sEPA-IDD) scheme for hardware implementation. To alleviate the computational burden, a simplified message update scheme is proposed, reducing complexity by 68% without performance degradation. Additionally, a unified design for extrinsic message computation further improves hardware utilization. Finally, we introduce the first unfolded EP-based IDD architecture to boost throughput. Compared with state-of-the-art (SOA) IDD receivers, the sEPA-IDD receiver implemented on 65 nm CMOS delivers a throughput of 3.07 Gb/s with a maximum 0.5 dB gain, achieving

$4.03times $

higher throughput and

$6.04times $

greater area efficiency.

期望传播（EP）在大规模多输入多输出（MIMO）检测中采用高阶调制实现了优异的性能。EP检测器的软输出可与涡轮软解码器迭代组合，以提高纠错性能。然而，随着天线数量和调制阶数的增加，基于ep的迭代检测和解码（IDD）接收机的计算复杂度呈指数增长。在本文中，我们提出了一种简化的基于EP近似的IDD （sEPA-IDD）硬件实现方案。为了减轻计算负担，提出了一种简化的消息更新方案，在不降低性能的情况下将复杂度降低68%。此外，外部消息计算的统一设计进一步提高了硬件利用率。最后，我们介绍了第一个未展开的基于ep的IDD架构，以提高吞吐量。与最先进的（SOA） IDD接收器相比，采用65纳米CMOS实现的sEPA-IDD接收器的吞吐量为3.07 Gb/s，最大增益为0.5 dB，吞吐量提高4.03倍，面积效率提高6.04倍。

{"title":"A Soft Iterative Receiver With Simplified EP Detection for Coded MIMO Systems","authors":"Xiaosi Tan;Xiaohua Xie;Houren Ji;Tiancan Xia;Yongming Huang;Xiaohu You;Chuan Zhang","doi":"10.1109/TVLSI.2025.3536019","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3536019","url":null,"abstract":"Expectation propagation (EP) achieves excellent performance with high-order modulation in massive multiple-input multiple-output (MIMO) detection. The soft output of the EP detector can be iteratively combined with turbo soft decoders to enhance error-correction performance. However, the implementation of EP-based iterative detection and decoding (IDD) receivers suffer from an exponential increase in computational complexity as the number of antennas and modulation order grows. In this brief, we propose a simplified EP approximation-based IDD (sEPA-IDD) scheme for hardware implementation. To alleviate the computational burden, a simplified message update scheme is proposed, reducing complexity by 68% without performance degradation. Additionally, a unified design for extrinsic message computation further improves hardware utilization. Finally, we introduce the first unfolded EP-based IDD architecture to boost throughput. Compared with state-of-the-art (SOA) IDD receivers, the sEPA-IDD receiver implemented on 65 nm CMOS delivers a throughput of 3.07 Gb/s with a maximum 0.5 dB gain, achieving <inline-formula> <tex-math>$4.03times $ </tex-math></inline-formula> higher throughput and <inline-formula> <tex-math>$6.04times $ </tex-math></inline-formula> greater area efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1994-1998"},"PeriodicalIF":2.8,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RHT_NoC: A Reconfigurable Hybrid Topology Architecture for Chiplet-Based Multicore System RHT_NoC：基于芯片的多核系统的可重构混合拓扑结构

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-06-05 DOI: 10.1109/TVLSI.2025.3572112

Dongyu Xu;Wu Zhou;Zhengfeng Huang;Huaguo Liang;Xiaoqing Wen

Chiplet-based system-on-chip (SoC) architectures, leveraging 2.5-D/3-D integration technologies, provide scalable solutions for a wide range of applications. Achieving high performance and cost-effectiveness in these systems relies heavily on optimizing die-to-die interconnect topologies and designs, which are essential for seamless interchiplet communication. This article introduces a reconfigurable hybrid topology (RHT) architecture designed for chiplet-based multicore systems. RHT achieves high performance and energy efficiency by dynamically reconfiguring the network topology to traffic variations, adaptively selecting transport subnets, and optimizing link bandwidth allocation, thereby minimizing congestion and maximizing packet throughput. Furthermore, RHT leverages global traffic information to dynamically combine Torus loops, maximizing opportunities for rapid packet transmission delivery while guaranteeing minimal hop counts. Moreover, RHT accelerates packet transmission via bufferless combined loops, extending the continuous sleeping periods of routers, improves power gating efficiency, and significantly reduces static power consumption. Simulation results indicate that the Mesh-DyRing achieves over a 40% reduction in network latency and more than a 20% decrease in power consumption overhead compared to the baseline design. When compared to WiNoC, an advanced hybrid wired-wireless topology design, the Mesh-DyRing-PG configuration reduces power consumption by 56.2% while maintaining equivalent average network latency.

基于芯片的系统级芯片（SoC）架构利用2.5 d /3-D集成技术，为广泛的应用提供可扩展的解决方案。在这些系统中实现高性能和成本效益在很大程度上依赖于优化模对模互连拓扑和设计，这对于无缝芯片间通信至关重要。本文介绍了为基于芯片的多核系统设计的可重构混合拓扑（RHT）架构。RHT通过根据流量变化动态地重新配置网络拓扑，自适应地选择传输子网，优化链路带宽分配，从而最大限度地减少拥塞，最大限度地提高数据包吞吐量，从而实现高性能和高能效。此外，RHT利用全球流量信息来动态组合环面环路，在保证最小跳数的同时，最大限度地提高了快速数据包传输的机会。此外，RHT通过无缓冲组合环路加速分组传输，延长路由器的连续休眠时间，提高电源门控效率，显著降低静态功耗。仿真结果表明，与基线设计相比，Mesh-DyRing实现了超过40%的网络延迟减少和超过20%的功耗开销减少。与WiNoC（一种先进的混合有线无线拓扑设计）相比，Mesh-DyRing-PG配置在保持同等平均网络延迟的同时，降低了56.2%的功耗。

{"title":"RHT_NoC: A Reconfigurable Hybrid Topology Architecture for Chiplet-Based Multicore System","authors":"Dongyu Xu;Wu Zhou;Zhengfeng Huang;Huaguo Liang;Xiaoqing Wen","doi":"10.1109/TVLSI.2025.3572112","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3572112","url":null,"abstract":"Chiplet-based system-on-chip (SoC) architectures, leveraging 2.5-D/3-D integration technologies, provide scalable solutions for a wide range of applications. Achieving high performance and cost-effectiveness in these systems relies heavily on optimizing die-to-die interconnect topologies and designs, which are essential for seamless interchiplet communication. This article introduces a reconfigurable hybrid topology (RHT) architecture designed for chiplet-based multicore systems. RHT achieves high performance and energy efficiency by dynamically reconfiguring the network topology to traffic variations, adaptively selecting transport subnets, and optimizing link bandwidth allocation, thereby minimizing congestion and maximizing packet throughput. Furthermore, RHT leverages global traffic information to dynamically combine Torus loops, maximizing opportunities for rapid packet transmission delivery while guaranteeing minimal hop counts. Moreover, RHT accelerates packet transmission via bufferless combined loops, extending the continuous sleeping periods of routers, improves power gating efficiency, and significantly reduces static power consumption. Simulation results indicate that the Mesh-DyRing achieves over a 40% reduction in network latency and more than a 20% decrease in power consumption overhead compared to the baseline design. When compared to WiNoC, an advanced hybrid wired-wireless topology design, the Mesh-DyRing-PG configuration reduces power consumption by 56.2% while maintaining equivalent average network latency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2104-2117"},"PeriodicalIF":2.8,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations 基于gem操作的可扩展FPGA架构与自适应内存利用

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-06-03 DOI: 10.1109/TVLSI.2025.3571677

Anastasios Petropoulos;Theodore Antonakopoulos

Deep neural network (DNN) inference relies increasingly on specialized hardware for high computational efficiency. This work introduces a field-programmable gate array (FPGA)-based dynamically configurable accelerator featuring systolic arrays (SAs), high-bandwidth memory (HBM), and UltraRAMs. We present two processing unit (PU) configurations with different computing capabilities using the same interfaces and peripheral blocks. By instantiating multiple PUs and employing a heuristic weight transfer schedule, the architecture achieves notable throughput efficiency over prior works. Moreover, we outline how the architecture can be extended to emulate analog in-memory computing (AIMC) devices to aid next-generation heterogeneous AIMC chip designs and investigate device-level noise behavior. Overall, this brief presents a versatile DNN inference acceleration architecture adaptable to various models and future FPGA designs.

深度神经网络（DNN）推理越来越依赖于专门的硬件来提高计算效率。这项工作介绍了一种基于现场可编程门阵列（FPGA）的动态可配置加速器，具有收缩阵列（SAs），高带宽存储器（HBM）和ultraram。我们提出了两种处理单元（PU）配置，使用相同的接口和外设块具有不同的计算能力。通过实例化多个pu并采用启发式权重转移调度，该体系结构比先前的工作实现了显着的吞吐量效率。此外，我们概述了如何扩展该架构以模拟内存中模拟计算（AIMC）设备，以帮助下一代异构AIMC芯片设计并研究设备级噪声行为。总体而言，本简介介绍了一个通用的DNN推理加速架构，适用于各种模型和未来的FPGA设计。

引用次数: 0

Design of a Low-Power Analog Integrated Deep Convolutional Neural Network 低功耗模拟集成深度卷积神经网络的设计

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-06-03 DOI: 10.1109/TVLSI.2025.3573045

Zisis Foufas;Vassilis Alimisis;Paul P. Sotiriadis

In this article, a framework for the analog implementation of a deep convolutional neural network (CNN) is introduced and used to derive a new circuit architecture which is composed of an improved analog multiplier and circuit blocks implementing the ReLU activation function and the argmax operator. The operating principles of the individual blocks, as well as those of the complete architecture, are analyzed and used to realize a low-power analog classifier, consuming less than

$1.8~mu text {W}$

. The proper operation of the classifier is verified via a comparison with a software equivalent implementation and its performance is evaluated against existing circuit architectures. The proposed architecture is implemented in a TSMC 90-nm CMOS process and simulated using Cadence IC Suite for both schematic and layout design. Corner and Monte Carlo mismatch simulations of the schematic and the physical circuit (postlayout) were conducted to evaluate the effect of transistor mismatches and process voltage temperature (PVT) variations and to showcase a proposed systematic method for offsetting their effect.

本文介绍了一种深度卷积神经网络（CNN）的模拟实现框架，并利用该框架推导了一种新的电路结构，该结构由改进的模拟乘法器和实现ReLU激活函数和argmax算子的电路块组成。分析了各个模块的工作原理，以及整个体系结构的工作原理，并用于实现低功耗模拟分类器，功耗小于1.8~mu text {W}$。通过与软件等效实现的比较验证了分类器的正确运行，并根据现有电路架构评估了其性能。提出的架构在台积电90纳米CMOS工艺中实现，并使用Cadence IC Suite进行原理图和版图设计仿真。对原理图和物理电路（后布局）进行角和蒙特卡罗失配模拟，以评估晶体管失配和工艺电压温度（PVT）变化的影响，并展示一种拟议的系统方法来抵消它们的影响。

{"title":"Design of a Low-Power Analog Integrated Deep Convolutional Neural Network","authors":"Zisis Foufas;Vassilis Alimisis;Paul P. Sotiriadis","doi":"10.1109/TVLSI.2025.3573045","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573045","url":null,"abstract":"In this article, a framework for the analog implementation of a deep convolutional neural network (CNN) is introduced and used to derive a new circuit architecture which is composed of an improved analog multiplier and circuit blocks implementing the ReLU activation function and the argmax operator. The operating principles of the individual blocks, as well as those of the complete architecture, are analyzed and used to realize a low-power analog classifier, consuming less than <inline-formula> <tex-math>$1.8~mu text {W}$ </tex-math></inline-formula>. The proper operation of the classifier is verified via a comparison with a software equivalent implementation and its performance is evaluated against existing circuit architectures. The proposed architecture is implemented in a TSMC 90-nm CMOS process and simulated using Cadence IC Suite for both schematic and layout design. Corner and Monte Carlo mismatch simulations of the schematic and the physical circuit (postlayout) were conducted to evaluate the effect of transistor mismatches and process voltage temperature (PVT) variations and to showcase a proposed systematic method for offsetting their effect.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2172-2185"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Speed Compute-Efficient Bandit Learning for Many Arms 多兵种高速计算高效强盗学习

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-06-03 DOI: 10.1109/TVLSI.2025.3573924

Ishaan Sharma;Sumit J. Darak;Rohit Kumar

Multiarmed bandits (MABs) are online machine learning algorithms that aim to identify the optimal arm without prior statistical knowledge via the exploration-exploitation tradeoff. The performance metric, regret, and computational complexity of the MAB algorithms degrade with the increase in the number of arms, K. In applications such as wireless communication, radar systems, and sensor networks, K, i.e., the number of antennas, beams, bands, etc., is expected to be large. In this work, we consider focused exploration-based MAB, which outperforms conventional MAB for large K, and its mapping on various edge processors and multiprocessor system on a chip (MPSoC) via hardware-software co-design (HSCD) and fixed point (FP) analysis. The proposed architecture offers 67% reduction in average cumulative regret, 84% reduction in execution time on edge processor, 97% reduction in execution time using FPGA-based accelerator, and 10% savings in resources over state-of-the-art MABs for large

$K=100$

.

Multiarmed bandits （mab）是一种在线机器学习算法，旨在通过探索-开发权衡，在没有事先统计知识的情况下识别最佳手臂。MAB算法的性能指标、遗憾率和计算复杂度随着臂数K的增加而降低。在无线通信、雷达系统和传感器网络等应用中，K（即天线、波束、频带等的数量）预计会很大。在这项工作中，我们考虑了基于重点探索的MAB，它在大K下优于传统的MAB，并通过硬件软件协同设计（HSCD）和固定点（FP）分析将其映射到各种边缘处理器和片上多处理器系统（MPSoC）上。所提出的架构可以减少67%的平均累积遗憾，在边缘处理器上减少84%的执行时间，使用基于fpga的加速器减少97%的执行时间，并且在K=100美元时，比最先进的mab节省10%的资源。

引用次数: 0