Pub Date : 2025-06-17DOI: 10.1109/TVLSI.2025.3576855
Marco Bertuletti;Yichao Zhang;Alessandro Vanelli-Coralli;Luca Benini
Following the scale-up of new radio (NR) complexity in 5G and beyond, the physical layer’s computing load on base stations is increasing under a strictly constrained latency and power budget; base stations must process $gt$ 20-Gb/s uplink wireless data rate on the fly, in $lt$ 10 W. At the same time, the programmability and reconfigurability of base station components are the key requirements; it reduces the time and cost of new networks’ deployment, it lowers the acceptance threshold for industry players to enter the market, and it ensures return on investments in a fast-paced evolution of standards. In this article, we present the design of a many-core cluster for 5G and beyond base station processing. Our design features 1024, streamlined RISC-V cores with domain-specific FP extensions, and 4-MiB shared memory. It provides the necessary computational capabilities for software-defined processing of the lower physical layer of 5G physical uplink shared channel (PUSCH), satisfying high-end throughput requirements (66 Gb/s for a transition time interval (TTI), 9.4–302 Gb/s depending on the processing stage). The throughput metrics for the implemented functions are ten times higher than in state-of-the-art (SoTA) application-specific instruction processors (ASIPs). The energy efficiency on key NR kernels (2–41 Gb/s/W), measured at 800 MHz, ${25}~^{circ } $ C, and 0.8 V, on a placed and routed instance in 12-nm CMOS technology, is competitive with SoTA architectures. The PUSCH processing runs end-to-end on a single cluster in 1.7 ms, at <6-W average power consumption, achieving 12 Gb/s/W.
{"title":"A 66-Gb/s/5.5-W RISC-V Many-Core Cluster for 5G+ Software-Defined Radio Uplinks","authors":"Marco Bertuletti;Yichao Zhang;Alessandro Vanelli-Coralli;Luca Benini","doi":"10.1109/TVLSI.2025.3576855","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576855","url":null,"abstract":"Following the scale-up of new radio (NR) complexity in 5G and beyond, the physical layer’s computing load on base stations is increasing under a strictly constrained latency and power budget; base stations must process <inline-formula> <tex-math>$gt$ </tex-math></inline-formula> 20-Gb/s uplink wireless data rate on the fly, in <inline-formula> <tex-math>$lt$ </tex-math></inline-formula> 10 W. At the same time, the programmability and reconfigurability of base station components are the key requirements; it reduces the time and cost of new networks’ deployment, it lowers the acceptance threshold for industry players to enter the market, and it ensures return on investments in a fast-paced evolution of standards. In this article, we present the design of a many-core cluster for 5G and beyond base station processing. Our design features 1024, streamlined RISC-V cores with domain-specific FP extensions, and 4-MiB shared memory. It provides the necessary computational capabilities for software-defined processing of the lower physical layer of 5G physical uplink shared channel (PUSCH), satisfying high-end throughput requirements (66 Gb/s for a transition time interval (TTI), 9.4–302 Gb/s depending on the processing stage). The throughput metrics for the implemented functions are ten times higher than in state-of-the-art (SoTA) application-specific instruction processors (ASIPs). The energy efficiency on key NR kernels (2–41 Gb/s/W), measured at 800 MHz, <inline-formula> <tex-math>${25}~^{circ } $ </tex-math></inline-formula>C, and 0.8 V, on a placed and routed instance in 12-nm CMOS technology, is competitive with SoTA architectures. The PUSCH processing runs end-to-end on a single cluster in 1.7 ms, at <6-W average power consumption, achieving 12 Gb/s/W.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2225-2238"},"PeriodicalIF":2.8,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sine/cosine (SC) is widely used in practical engineering applications, such as image compression and motor control. Nevertheless, due to power sensitivity and speed demands, SC acceleration suffers from limitations in traditional von-Neumann architectures. To overcome this challenge, we propose accelerating SC and convolution using a static random access memory (SRAM)-based in-memory computing (IMC) architecture through an algorithm-architecture co-optimization manner. We develop the first SC algorithm that transforms nonlinear operations into the IMC paradigm, enabling IMC array to handle both SC and artificial intelligence (AI) tasks and making the IMC array a reusable module. Our architecture extends computing functions of macro dedicated to convolutional neural networks (CNNs), with less than a 1% area increase. The proposed SC algorithm for FP32 data achieves high accuracy within 1 unit in the least significant place (ulp) error margin compared with C math library. Moreover, we build an intelligent IMC system that supports various CNNs. Our IMC macro implements 512-kb binary weight storage within 3.0366-mm2 area in SMIC 28-nm technology and presents area/energy efficiency of 2160.29–270.04 GOPS/mm2 and 513.95–8.03 TOPS/W in CNN mode. The proposed algorithm and architecture facilitate the integration of more nonlinear functions into IMC with minimal area overhead.
{"title":"SC-IMC: Algorithm-Architecture Co-Optimized SRAM-Based In-Memory Computing for Sine/Cosine and Convolutional Acceleration","authors":"Qi Cao;Shang Wang;Haisheng Fu;Qifan Gao;Zhenjiao Chen;Li Gao;Feng Liang","doi":"10.1109/TVLSI.2025.3573753","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573753","url":null,"abstract":"Sine/cosine (SC) is widely used in practical engineering applications, such as image compression and motor control. Nevertheless, due to power sensitivity and speed demands, SC acceleration suffers from limitations in traditional von-Neumann architectures. To overcome this challenge, we propose accelerating SC and convolution using a static random access memory (SRAM)-based in-memory computing (IMC) architecture through an algorithm-architecture co-optimization manner. We develop the first SC algorithm that transforms nonlinear operations into the IMC paradigm, enabling IMC array to handle both SC and artificial intelligence (AI) tasks and making the IMC array a reusable module. Our architecture extends computing functions of macro dedicated to convolutional neural networks (CNNs), with less than a 1% area increase. The proposed SC algorithm for FP32 data achieves high accuracy within 1 unit in the least significant place (ulp) error margin compared with <italic>C</i> math library. Moreover, we build an intelligent IMC system that supports various CNNs. Our IMC macro implements 512-kb binary weight storage within 3.0366-mm<sup>2</sup> area in SMIC 28-nm technology and presents area/energy efficiency of 2160.29–270.04 GOPS/mm<sup>2</sup> and 513.95–8.03 TOPS/W in CNN mode. The proposed algorithm and architecture facilitate the integration of more nonlinear functions into IMC with minimal area overhead.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2200-2213"},"PeriodicalIF":2.8,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-11DOI: 10.1109/TVLSI.2025.3576360
Farzan Rezaei;Loai G. Salem
This article introduces a fourth-order $G_{m}$ -C low-pass filter for ECG detection that achieves high linearity despite operating under a 0.5 V supply by 1) placing the differential pairs (DPs) of the employed $G_{m}$ stages in a two-loop feedback structure, 2) employing body-driven rather than gate-driven $G_{m}$ DPs, and 3) using current mirrors in place of cascoded transistors in a conventional $G_{m}$ stage. Measurement results of a $0.18~mu $ m CMOS prototype show that the proposed filter, operating with a $V_{text {DD}}$ of 0.5 V, achieves an third-order harmonic distortion (HD3) below −40 dB for input amplitudes up to 340 mVpp. With an integrated noise of $154.7~mu $ Vrms over a 240-Hz bandwidth, the filter exhibits a dynamic range (DR) of 53.6 dB, which is competitive with previously reported works.
{"title":"A Fourth-Order Tunable Bandwidth Gm-C Filter for ECG Detection Achieving −7.9 dBV IIP3 Under a 0.5 V Supply","authors":"Farzan Rezaei;Loai G. Salem","doi":"10.1109/TVLSI.2025.3576360","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576360","url":null,"abstract":"This article introduces a fourth-order <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula>-C low-pass filter for ECG detection that achieves high linearity despite operating under a 0.5 V supply by 1) placing the differential pairs (DPs) of the employed <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula> stages in a two-loop feedback structure, 2) employing body-driven rather than gate-driven <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula> DPs, and 3) using current mirrors in place of cascoded transistors in a conventional <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula> stage. Measurement results of a <inline-formula> <tex-math>$0.18~mu $ </tex-math></inline-formula>m CMOS prototype show that the proposed filter, operating with a <inline-formula> <tex-math>$V_{text {DD}}$ </tex-math></inline-formula> of 0.5 V, achieves an third-order harmonic distortion (HD3) below −40 dB for input amplitudes up to 340 mV<sub>pp</sub>. With an integrated noise of <inline-formula> <tex-math>$154.7~mu $ </tex-math></inline-formula>V<sub>rms</sub> over a 240-Hz bandwidth, the filter exhibits a dynamic range (DR) of 53.6 dB, which is competitive with previously reported works.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2438-2448"},"PeriodicalIF":3.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-11DOI: 10.1109/TVLSI.2025.3576782
Sin-Wei Chiu;Keshab K. Parhi
Quantum computers pose a significant threat to modern cryptographic systems by efficiently solving problems such as integer factorization through Shor’s algorithm. Homomorphic encryption (HE) schemes based on ring learning with errors (Ring-LWE) offer a quantum-resistant framework for secure computations on encrypted data. Many of these schemes rely on polynomial multiplication, which can be efficiently accelerated using the number theoretic transform (NTT) in leveled HE, ensuring practical performance for privacy-preserving applications. This article presents a novel NTT-based serial pipelined multiplier that achieves full-hardware utilization through interleaved folding, and overcomes the 50% under-utilization limitation of the conventional serial R2MDC architecture. In addition, it explores tradeoffs in pipelined parallel designs, including serial, 2-parallel, and 4-parallel architectures. Our designs leverage increased parallelism, efficient folding techniques, and optimizations for a selected constant modulus to achieve superior throughput (TP) compared with state-of-the-art implementations. While the serial fold design minimizes area consumption, the 4-parallel design maximizes TP. Experimental results on the Virtex-7 platform demonstrate that our architectures achieve at least 2.22 times higher TP/area for a polynomial length of 1024 and 1.84 times for a polynomial length of 4096 in the serial fold design, while the 4-parallel design achieves at least 2.78 times and 2.79 times, respectively. The efficiency gain is even more pronounced in TP squared over area, where the serial fold and 4-parallel designs outperform prior works by at least 4.98 times and 26.43 times for a polynomial length of 1024 and 6.7 times and 43.77 times for a polynomial length of 4096, respectively. These results highlight the effectiveness of our architectures in balancing performance, area efficiency, and flexibility, making them well-suited for high-speed cryptographic applications.
{"title":"Architectures for Serial and Parallel Pipelined NTT-Based Polynomial Modular Multiplication","authors":"Sin-Wei Chiu;Keshab K. Parhi","doi":"10.1109/TVLSI.2025.3576782","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576782","url":null,"abstract":"Quantum computers pose a significant threat to modern cryptographic systems by efficiently solving problems such as integer factorization through Shor’s algorithm. Homomorphic encryption (HE) schemes based on ring learning with errors (Ring-LWE) offer a quantum-resistant framework for secure computations on encrypted data. Many of these schemes rely on polynomial multiplication, which can be efficiently accelerated using the number theoretic transform (NTT) in leveled HE, ensuring practical performance for privacy-preserving applications. This article presents a novel NTT-based serial pipelined multiplier that achieves full-hardware utilization through interleaved folding, and overcomes the 50% under-utilization limitation of the conventional serial R2MDC architecture. In addition, it explores tradeoffs in pipelined parallel designs, including serial, 2-parallel, and 4-parallel architectures. Our designs leverage increased parallelism, efficient folding techniques, and optimizations for a selected constant modulus to achieve superior throughput (TP) compared with state-of-the-art implementations. While the serial fold design minimizes area consumption, the 4-parallel design maximizes TP. Experimental results on the Virtex-7 platform demonstrate that our architectures achieve at least 2.22 times higher TP/area for a polynomial length of 1024 and 1.84 times for a polynomial length of 4096 in the serial fold design, while the 4-parallel design achieves at least 2.78 times and 2.79 times, respectively. The efficiency gain is even more pronounced in TP squared over area, where the serial fold and 4-parallel designs outperform prior works by at least 4.98 times and 26.43 times for a polynomial length of 1024 and 6.7 times and 43.77 times for a polynomial length of 4096, respectively. These results highlight the effectiveness of our architectures in balancing performance, area efficiency, and flexibility, making them well-suited for high-speed cryptographic applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2474-2487"},"PeriodicalIF":3.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Expectation propagation (EP) achieves excellent performance with high-order modulation in massive multiple-input multiple-output (MIMO) detection. The soft output of the EP detector can be iteratively combined with turbo soft decoders to enhance error-correction performance. However, the implementation of EP-based iterative detection and decoding (IDD) receivers suffer from an exponential increase in computational complexity as the number of antennas and modulation order grows. In this brief, we propose a simplified EP approximation-based IDD (sEPA-IDD) scheme for hardware implementation. To alleviate the computational burden, a simplified message update scheme is proposed, reducing complexity by 68% without performance degradation. Additionally, a unified design for extrinsic message computation further improves hardware utilization. Finally, we introduce the first unfolded EP-based IDD architecture to boost throughput. Compared with state-of-the-art (SOA) IDD receivers, the sEPA-IDD receiver implemented on 65 nm CMOS delivers a throughput of 3.07 Gb/s with a maximum 0.5 dB gain, achieving $4.03times $ higher throughput and $6.04times $ greater area efficiency.
{"title":"A Soft Iterative Receiver With Simplified EP Detection for Coded MIMO Systems","authors":"Xiaosi Tan;Xiaohua Xie;Houren Ji;Tiancan Xia;Yongming Huang;Xiaohu You;Chuan Zhang","doi":"10.1109/TVLSI.2025.3536019","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3536019","url":null,"abstract":"Expectation propagation (EP) achieves excellent performance with high-order modulation in massive multiple-input multiple-output (MIMO) detection. The soft output of the EP detector can be iteratively combined with turbo soft decoders to enhance error-correction performance. However, the implementation of EP-based iterative detection and decoding (IDD) receivers suffer from an exponential increase in computational complexity as the number of antennas and modulation order grows. In this brief, we propose a simplified EP approximation-based IDD (sEPA-IDD) scheme for hardware implementation. To alleviate the computational burden, a simplified message update scheme is proposed, reducing complexity by 68% without performance degradation. Additionally, a unified design for extrinsic message computation further improves hardware utilization. Finally, we introduce the first unfolded EP-based IDD architecture to boost throughput. Compared with state-of-the-art (SOA) IDD receivers, the sEPA-IDD receiver implemented on 65 nm CMOS delivers a throughput of 3.07 Gb/s with a maximum 0.5 dB gain, achieving <inline-formula> <tex-math>$4.03times $ </tex-math></inline-formula> higher throughput and <inline-formula> <tex-math>$6.04times $ </tex-math></inline-formula> greater area efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1994-1998"},"PeriodicalIF":2.8,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chiplet-based system-on-chip (SoC) architectures, leveraging 2.5-D/3-D integration technologies, provide scalable solutions for a wide range of applications. Achieving high performance and cost-effectiveness in these systems relies heavily on optimizing die-to-die interconnect topologies and designs, which are essential for seamless interchiplet communication. This article introduces a reconfigurable hybrid topology (RHT) architecture designed for chiplet-based multicore systems. RHT achieves high performance and energy efficiency by dynamically reconfiguring the network topology to traffic variations, adaptively selecting transport subnets, and optimizing link bandwidth allocation, thereby minimizing congestion and maximizing packet throughput. Furthermore, RHT leverages global traffic information to dynamically combine Torus loops, maximizing opportunities for rapid packet transmission delivery while guaranteeing minimal hop counts. Moreover, RHT accelerates packet transmission via bufferless combined loops, extending the continuous sleeping periods of routers, improves power gating efficiency, and significantly reduces static power consumption. Simulation results indicate that the Mesh-DyRing achieves over a 40% reduction in network latency and more than a 20% decrease in power consumption overhead compared to the baseline design. When compared to WiNoC, an advanced hybrid wired-wireless topology design, the Mesh-DyRing-PG configuration reduces power consumption by 56.2% while maintaining equivalent average network latency.
基于芯片的系统级芯片(SoC)架构利用2.5 d /3-D集成技术,为广泛的应用提供可扩展的解决方案。在这些系统中实现高性能和成本效益在很大程度上依赖于优化模对模互连拓扑和设计,这对于无缝芯片间通信至关重要。本文介绍了为基于芯片的多核系统设计的可重构混合拓扑(RHT)架构。RHT通过根据流量变化动态地重新配置网络拓扑,自适应地选择传输子网,优化链路带宽分配,从而最大限度地减少拥塞,最大限度地提高数据包吞吐量,从而实现高性能和高能效。此外,RHT利用全球流量信息来动态组合环面环路,在保证最小跳数的同时,最大限度地提高了快速数据包传输的机会。此外,RHT通过无缓冲组合环路加速分组传输,延长路由器的连续休眠时间,提高电源门控效率,显著降低静态功耗。仿真结果表明,与基线设计相比,Mesh-DyRing实现了超过40%的网络延迟减少和超过20%的功耗开销减少。与WiNoC(一种先进的混合有线无线拓扑设计)相比,Mesh-DyRing-PG配置在保持同等平均网络延迟的同时,降低了56.2%的功耗。
{"title":"RHT_NoC: A Reconfigurable Hybrid Topology Architecture for Chiplet-Based Multicore System","authors":"Dongyu Xu;Wu Zhou;Zhengfeng Huang;Huaguo Liang;Xiaoqing Wen","doi":"10.1109/TVLSI.2025.3572112","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3572112","url":null,"abstract":"Chiplet-based system-on-chip (SoC) architectures, leveraging 2.5-D/3-D integration technologies, provide scalable solutions for a wide range of applications. Achieving high performance and cost-effectiveness in these systems relies heavily on optimizing die-to-die interconnect topologies and designs, which are essential for seamless interchiplet communication. This article introduces a reconfigurable hybrid topology (RHT) architecture designed for chiplet-based multicore systems. RHT achieves high performance and energy efficiency by dynamically reconfiguring the network topology to traffic variations, adaptively selecting transport subnets, and optimizing link bandwidth allocation, thereby minimizing congestion and maximizing packet throughput. Furthermore, RHT leverages global traffic information to dynamically combine Torus loops, maximizing opportunities for rapid packet transmission delivery while guaranteeing minimal hop counts. Moreover, RHT accelerates packet transmission via bufferless combined loops, extending the continuous sleeping periods of routers, improves power gating efficiency, and significantly reduces static power consumption. Simulation results indicate that the Mesh-DyRing achieves over a 40% reduction in network latency and more than a 20% decrease in power consumption overhead compared to the baseline design. When compared to WiNoC, an advanced hybrid wired-wireless topology design, the Mesh-DyRing-PG configuration reduces power consumption by 56.2% while maintaining equivalent average network latency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2104-2117"},"PeriodicalIF":2.8,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-03DOI: 10.1109/TVLSI.2025.3571677
Anastasios Petropoulos;Theodore Antonakopoulos
Deep neural network (DNN) inference relies increasingly on specialized hardware for high computational efficiency. This work introduces a field-programmable gate array (FPGA)-based dynamically configurable accelerator featuring systolic arrays (SAs), high-bandwidth memory (HBM), and UltraRAMs. We present two processing unit (PU) configurations with different computing capabilities using the same interfaces and peripheral blocks. By instantiating multiple PUs and employing a heuristic weight transfer schedule, the architecture achieves notable throughput efficiency over prior works. Moreover, we outline how the architecture can be extended to emulate analog in-memory computing (AIMC) devices to aid next-generation heterogeneous AIMC chip designs and investigate device-level noise behavior. Overall, this brief presents a versatile DNN inference acceleration architecture adaptable to various models and future FPGA designs.
{"title":"A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations","authors":"Anastasios Petropoulos;Theodore Antonakopoulos","doi":"10.1109/TVLSI.2025.3571677","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3571677","url":null,"abstract":"Deep neural network (DNN) inference relies increasingly on specialized hardware for high computational efficiency. This work introduces a field-programmable gate array (FPGA)-based dynamically configurable accelerator featuring systolic arrays (SAs), high-bandwidth memory (HBM), and UltraRAMs. We present two processing unit (PU) configurations with different computing capabilities using the same interfaces and peripheral blocks. By instantiating multiple PUs and employing a heuristic weight transfer schedule, the architecture achieves notable throughput efficiency over prior works. Moreover, we outline how the architecture can be extended to emulate analog in-memory computing (AIMC) devices to aid next-generation heterogeneous AIMC chip designs and investigate device-level noise behavior. Overall, this brief presents a versatile DNN inference acceleration architecture adaptable to various models and future FPGA designs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2334-2338"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-03DOI: 10.1109/TVLSI.2025.3573045
Zisis Foufas;Vassilis Alimisis;Paul P. Sotiriadis
In this article, a framework for the analog implementation of a deep convolutional neural network (CNN) is introduced and used to derive a new circuit architecture which is composed of an improved analog multiplier and circuit blocks implementing the ReLU activation function and the argmax operator. The operating principles of the individual blocks, as well as those of the complete architecture, are analyzed and used to realize a low-power analog classifier, consuming less than $1.8~mu text {W}$ . The proper operation of the classifier is verified via a comparison with a software equivalent implementation and its performance is evaluated against existing circuit architectures. The proposed architecture is implemented in a TSMC 90-nm CMOS process and simulated using Cadence IC Suite for both schematic and layout design. Corner and Monte Carlo mismatch simulations of the schematic and the physical circuit (postlayout) were conducted to evaluate the effect of transistor mismatches and process voltage temperature (PVT) variations and to showcase a proposed systematic method for offsetting their effect.
本文介绍了一种深度卷积神经网络(CNN)的模拟实现框架,并利用该框架推导了一种新的电路结构,该结构由改进的模拟乘法器和实现ReLU激活函数和argmax算子的电路块组成。分析了各个模块的工作原理,以及整个体系结构的工作原理,并用于实现低功耗模拟分类器,功耗小于1.8~mu text {W}$。通过与软件等效实现的比较验证了分类器的正确运行,并根据现有电路架构评估了其性能。提出的架构在台积电90纳米CMOS工艺中实现,并使用Cadence IC Suite进行原理图和版图设计仿真。对原理图和物理电路(后布局)进行角和蒙特卡罗失配模拟,以评估晶体管失配和工艺电压温度(PVT)变化的影响,并展示一种拟议的系统方法来抵消它们的影响。
{"title":"Design of a Low-Power Analog Integrated Deep Convolutional Neural Network","authors":"Zisis Foufas;Vassilis Alimisis;Paul P. Sotiriadis","doi":"10.1109/TVLSI.2025.3573045","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573045","url":null,"abstract":"In this article, a framework for the analog implementation of a deep convolutional neural network (CNN) is introduced and used to derive a new circuit architecture which is composed of an improved analog multiplier and circuit blocks implementing the ReLU activation function and the argmax operator. The operating principles of the individual blocks, as well as those of the complete architecture, are analyzed and used to realize a low-power analog classifier, consuming less than <inline-formula> <tex-math>$1.8~mu text {W}$ </tex-math></inline-formula>. The proper operation of the classifier is verified via a comparison with a software equivalent implementation and its performance is evaluated against existing circuit architectures. The proposed architecture is implemented in a TSMC 90-nm CMOS process and simulated using Cadence IC Suite for both schematic and layout design. Corner and Monte Carlo mismatch simulations of the schematic and the physical circuit (postlayout) were conducted to evaluate the effect of transistor mismatches and process voltage temperature (PVT) variations and to showcase a proposed systematic method for offsetting their effect.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2172-2185"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-03DOI: 10.1109/TVLSI.2025.3573924
Ishaan Sharma;Sumit J. Darak;Rohit Kumar
Multiarmed bandits (MABs) are online machine learning algorithms that aim to identify the optimal arm without prior statistical knowledge via the exploration-exploitation tradeoff. The performance metric, regret, and computational complexity of the MAB algorithms degrade with the increase in the number of arms, K. In applications such as wireless communication, radar systems, and sensor networks, K, i.e., the number of antennas, beams, bands, etc., is expected to be large. In this work, we consider focused exploration-based MAB, which outperforms conventional MAB for large K, and its mapping on various edge processors and multiprocessor system on a chip (MPSoC) via hardware-software co-design (HSCD) and fixed point (FP) analysis. The proposed architecture offers 67% reduction in average cumulative regret, 84% reduction in execution time on edge processor, 97% reduction in execution time using FPGA-based accelerator, and 10% savings in resources over state-of-the-art MABs for large $K=100$ .
{"title":"High-Speed Compute-Efficient Bandit Learning for Many Arms","authors":"Ishaan Sharma;Sumit J. Darak;Rohit Kumar","doi":"10.1109/TVLSI.2025.3573924","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573924","url":null,"abstract":"Multiarmed bandits (MABs) are online machine learning algorithms that aim to identify the optimal arm without prior statistical knowledge via the exploration-exploitation tradeoff. The performance metric, regret, and computational complexity of the MAB algorithms degrade with the increase in the number of arms, <italic>K</i>. In applications such as wireless communication, radar systems, and sensor networks, <italic>K</i>, i.e., the number of antennas, beams, bands, etc., is expected to be large. In this work, we consider focused exploration-based MAB, which outperforms conventional MAB for large <italic>K</i>, and its mapping on various edge processors and multiprocessor system on a chip (MPSoC) via hardware-software co-design (HSCD) and fixed point (FP) analysis. The proposed architecture offers 67% reduction in average cumulative regret, 84% reduction in execution time on edge processor, 97% reduction in execution time using FPGA-based accelerator, and 10% savings in resources over state-of-the-art MABs for large <inline-formula> <tex-math>$K=100$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2099-2103"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-03DOI: 10.1109/TVLSI.2025.3573226
Isa Altoobaji;Ahmad Hassan;Mohamed Ali;Yves Audet;Ahmed Lakhssassi
In this article, a compact differential data transfer link architecture for isolated sensor interfaces (SIs) and immune to common mode transients (CMTs) is presented. The proposed architecture shows low latency supporting high-speed transmission with a low bit error rate (BER) in the presence of CMT noise for applications, such as data acquisition, biomedical equipment, and communication networks. In transportation applications, motors and actuators are subjected to harsh environmental conditions, e.g., lightning strikes and abnormal voltage operations. These conditions introduce noise and can cause damage to small electronics due to high-voltage power surges. To ensure human safety and circuitry protection, a data transfer system must be implemented between high-voltage and low-voltage domains. The proposed design has been simulated using Cadence tools, and a prototype has been manufactured in a 0.18-$mu $ m CMOS process. The fabricated prototype consumes an effective silicon area of $37.2times 10^{3}~mu $ m2 and can sustain a breakdown voltage of 710 Vrms. Experimental results show that the proposed solution achieves a CMT immunity (CMTI) of 2.5 kV/$mu $ s at a data rate of 480 Mb/s with a BER of $10^{-12}$ . The propagation delay is 3.9 ns with a 4 ps/°C variation rate over temperatures ranging from $- 31~^{circ }$ C to $100~^{circ }$ C. Under typical test conditions, the BER reaches $10^{-15}$ with a peak-to-peak data dependent jitter (DDJ) of 29.8 ps.
{"title":"A Compact High-Speed Capacitive Data Transfer Link With Common Mode Transient Rejection for Isolated Sensor Interfaces","authors":"Isa Altoobaji;Ahmad Hassan;Mohamed Ali;Yves Audet;Ahmed Lakhssassi","doi":"10.1109/TVLSI.2025.3573226","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573226","url":null,"abstract":"In this article, a compact differential data transfer link architecture for isolated sensor interfaces (SIs) and immune to common mode transients (CMTs) is presented. The proposed architecture shows low latency supporting high-speed transmission with a low bit error rate (BER) in the presence of CMT noise for applications, such as data acquisition, biomedical equipment, and communication networks. In transportation applications, motors and actuators are subjected to harsh environmental conditions, e.g., lightning strikes and abnormal voltage operations. These conditions introduce noise and can cause damage to small electronics due to high-voltage power surges. To ensure human safety and circuitry protection, a data transfer system must be implemented between high-voltage and low-voltage domains. The proposed design has been simulated using Cadence tools, and a prototype has been manufactured in a 0.18-<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>m CMOS process. The fabricated prototype consumes an effective silicon area of <inline-formula> <tex-math>$37.2times 10^{3}~mu $ </tex-math></inline-formula>m<sup>2</sup> and can sustain a breakdown voltage of 710 V<sub>rms</sub>. Experimental results show that the proposed solution achieves a CMT immunity (CMTI) of 2.5 kV/<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>s at a data rate of 480 Mb/s with a BER of <inline-formula> <tex-math>$10^{-12}$ </tex-math></inline-formula>. The propagation delay is 3.9 ns with a 4 ps/°C variation rate over temperatures ranging from <inline-formula> <tex-math>$- 31~^{circ }$ </tex-math></inline-formula>C to <inline-formula> <tex-math>$100~^{circ }$ </tex-math></inline-formula>C. Under typical test conditions, the BER reaches <inline-formula> <tex-math>$10^{-15}$ </tex-math></inline-formula> with a peak-to-peak data dependent jitter (DDJ) of 29.8 ps.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2163-2171"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}