首页 > 最新文献

IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

英文 中文
RHT_NoC: A Reconfigurable Hybrid Topology Architecture for Chiplet-Based Multicore System RHT_NoC:基于芯片的多核系统的可重构混合拓扑结构
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-05 DOI: 10.1109/TVLSI.2025.3572112
Dongyu Xu;Wu Zhou;Zhengfeng Huang;Huaguo Liang;Xiaoqing Wen
Chiplet-based system-on-chip (SoC) architectures, leveraging 2.5-D/3-D integration technologies, provide scalable solutions for a wide range of applications. Achieving high performance and cost-effectiveness in these systems relies heavily on optimizing die-to-die interconnect topologies and designs, which are essential for seamless interchiplet communication. This article introduces a reconfigurable hybrid topology (RHT) architecture designed for chiplet-based multicore systems. RHT achieves high performance and energy efficiency by dynamically reconfiguring the network topology to traffic variations, adaptively selecting transport subnets, and optimizing link bandwidth allocation, thereby minimizing congestion and maximizing packet throughput. Furthermore, RHT leverages global traffic information to dynamically combine Torus loops, maximizing opportunities for rapid packet transmission delivery while guaranteeing minimal hop counts. Moreover, RHT accelerates packet transmission via bufferless combined loops, extending the continuous sleeping periods of routers, improves power gating efficiency, and significantly reduces static power consumption. Simulation results indicate that the Mesh-DyRing achieves over a 40% reduction in network latency and more than a 20% decrease in power consumption overhead compared to the baseline design. When compared to WiNoC, an advanced hybrid wired-wireless topology design, the Mesh-DyRing-PG configuration reduces power consumption by 56.2% while maintaining equivalent average network latency.
基于芯片的系统级芯片(SoC)架构利用2.5 d /3-D集成技术,为广泛的应用提供可扩展的解决方案。在这些系统中实现高性能和成本效益在很大程度上依赖于优化模对模互连拓扑和设计,这对于无缝芯片间通信至关重要。本文介绍了为基于芯片的多核系统设计的可重构混合拓扑(RHT)架构。RHT通过根据流量变化动态地重新配置网络拓扑,自适应地选择传输子网,优化链路带宽分配,从而最大限度地减少拥塞,最大限度地提高数据包吞吐量,从而实现高性能和高能效。此外,RHT利用全球流量信息来动态组合环面环路,在保证最小跳数的同时,最大限度地提高了快速数据包传输的机会。此外,RHT通过无缓冲组合环路加速分组传输,延长路由器的连续休眠时间,提高电源门控效率,显著降低静态功耗。仿真结果表明,与基线设计相比,Mesh-DyRing实现了超过40%的网络延迟减少和超过20%的功耗开销减少。与WiNoC(一种先进的混合有线无线拓扑设计)相比,Mesh-DyRing-PG配置在保持同等平均网络延迟的同时,降低了56.2%的功耗。
{"title":"RHT_NoC: A Reconfigurable Hybrid Topology Architecture for Chiplet-Based Multicore System","authors":"Dongyu Xu;Wu Zhou;Zhengfeng Huang;Huaguo Liang;Xiaoqing Wen","doi":"10.1109/TVLSI.2025.3572112","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3572112","url":null,"abstract":"Chiplet-based system-on-chip (SoC) architectures, leveraging 2.5-D/3-D integration technologies, provide scalable solutions for a wide range of applications. Achieving high performance and cost-effectiveness in these systems relies heavily on optimizing die-to-die interconnect topologies and designs, which are essential for seamless interchiplet communication. This article introduces a reconfigurable hybrid topology (RHT) architecture designed for chiplet-based multicore systems. RHT achieves high performance and energy efficiency by dynamically reconfiguring the network topology to traffic variations, adaptively selecting transport subnets, and optimizing link bandwidth allocation, thereby minimizing congestion and maximizing packet throughput. Furthermore, RHT leverages global traffic information to dynamically combine Torus loops, maximizing opportunities for rapid packet transmission delivery while guaranteeing minimal hop counts. Moreover, RHT accelerates packet transmission via bufferless combined loops, extending the continuous sleeping periods of routers, improves power gating efficiency, and significantly reduces static power consumption. Simulation results indicate that the Mesh-DyRing achieves over a 40% reduction in network latency and more than a 20% decrease in power consumption overhead compared to the baseline design. When compared to WiNoC, an advanced hybrid wired-wireless topology design, the Mesh-DyRing-PG configuration reduces power consumption by 56.2% while maintaining equivalent average network latency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2104-2117"},"PeriodicalIF":2.8,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations 基于gem操作的可扩展FPGA架构与自适应内存利用
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-03 DOI: 10.1109/TVLSI.2025.3571677
Anastasios Petropoulos;Theodore Antonakopoulos
Deep neural network (DNN) inference relies increasingly on specialized hardware for high computational efficiency. This work introduces a field-programmable gate array (FPGA)-based dynamically configurable accelerator featuring systolic arrays (SAs), high-bandwidth memory (HBM), and UltraRAMs. We present two processing unit (PU) configurations with different computing capabilities using the same interfaces and peripheral blocks. By instantiating multiple PUs and employing a heuristic weight transfer schedule, the architecture achieves notable throughput efficiency over prior works. Moreover, we outline how the architecture can be extended to emulate analog in-memory computing (AIMC) devices to aid next-generation heterogeneous AIMC chip designs and investigate device-level noise behavior. Overall, this brief presents a versatile DNN inference acceleration architecture adaptable to various models and future FPGA designs.
深度神经网络(DNN)推理越来越依赖于专门的硬件来提高计算效率。这项工作介绍了一种基于现场可编程门阵列(FPGA)的动态可配置加速器,具有收缩阵列(SAs),高带宽存储器(HBM)和ultraram。我们提出了两种处理单元(PU)配置,使用相同的接口和外设块具有不同的计算能力。通过实例化多个pu并采用启发式权重转移调度,该体系结构比先前的工作实现了显着的吞吐量效率。此外,我们概述了如何扩展该架构以模拟内存中模拟计算(AIMC)设备,以帮助下一代异构AIMC芯片设计并研究设备级噪声行为。总体而言,本简介介绍了一个通用的DNN推理加速架构,适用于各种模型和未来的FPGA设计。
{"title":"A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations","authors":"Anastasios Petropoulos;Theodore Antonakopoulos","doi":"10.1109/TVLSI.2025.3571677","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3571677","url":null,"abstract":"Deep neural network (DNN) inference relies increasingly on specialized hardware for high computational efficiency. This work introduces a field-programmable gate array (FPGA)-based dynamically configurable accelerator featuring systolic arrays (SAs), high-bandwidth memory (HBM), and UltraRAMs. We present two processing unit (PU) configurations with different computing capabilities using the same interfaces and peripheral blocks. By instantiating multiple PUs and employing a heuristic weight transfer schedule, the architecture achieves notable throughput efficiency over prior works. Moreover, we outline how the architecture can be extended to emulate analog in-memory computing (AIMC) devices to aid next-generation heterogeneous AIMC chip designs and investigate device-level noise behavior. Overall, this brief presents a versatile DNN inference acceleration architecture adaptable to various models and future FPGA designs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2334-2338"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design of a Low-Power Analog Integrated Deep Convolutional Neural Network 低功耗模拟集成深度卷积神经网络的设计
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-03 DOI: 10.1109/TVLSI.2025.3573045
Zisis Foufas;Vassilis Alimisis;Paul P. Sotiriadis
In this article, a framework for the analog implementation of a deep convolutional neural network (CNN) is introduced and used to derive a new circuit architecture which is composed of an improved analog multiplier and circuit blocks implementing the ReLU activation function and the argmax operator. The operating principles of the individual blocks, as well as those of the complete architecture, are analyzed and used to realize a low-power analog classifier, consuming less than $1.8~mu text {W}$ . The proper operation of the classifier is verified via a comparison with a software equivalent implementation and its performance is evaluated against existing circuit architectures. The proposed architecture is implemented in a TSMC 90-nm CMOS process and simulated using Cadence IC Suite for both schematic and layout design. Corner and Monte Carlo mismatch simulations of the schematic and the physical circuit (postlayout) were conducted to evaluate the effect of transistor mismatches and process voltage temperature (PVT) variations and to showcase a proposed systematic method for offsetting their effect.
本文介绍了一种深度卷积神经网络(CNN)的模拟实现框架,并利用该框架推导了一种新的电路结构,该结构由改进的模拟乘法器和实现ReLU激活函数和argmax算子的电路块组成。分析了各个模块的工作原理,以及整个体系结构的工作原理,并用于实现低功耗模拟分类器,功耗小于1.8~mu text {W}$。通过与软件等效实现的比较验证了分类器的正确运行,并根据现有电路架构评估了其性能。提出的架构在台积电90纳米CMOS工艺中实现,并使用Cadence IC Suite进行原理图和版图设计仿真。对原理图和物理电路(后布局)进行角和蒙特卡罗失配模拟,以评估晶体管失配和工艺电压温度(PVT)变化的影响,并展示一种拟议的系统方法来抵消它们的影响。
{"title":"Design of a Low-Power Analog Integrated Deep Convolutional Neural Network","authors":"Zisis Foufas;Vassilis Alimisis;Paul P. Sotiriadis","doi":"10.1109/TVLSI.2025.3573045","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573045","url":null,"abstract":"In this article, a framework for the analog implementation of a deep convolutional neural network (CNN) is introduced and used to derive a new circuit architecture which is composed of an improved analog multiplier and circuit blocks implementing the ReLU activation function and the argmax operator. The operating principles of the individual blocks, as well as those of the complete architecture, are analyzed and used to realize a low-power analog classifier, consuming less than <inline-formula> <tex-math>$1.8~mu text {W}$ </tex-math></inline-formula>. The proper operation of the classifier is verified via a comparison with a software equivalent implementation and its performance is evaluated against existing circuit architectures. The proposed architecture is implemented in a TSMC 90-nm CMOS process and simulated using Cadence IC Suite for both schematic and layout design. Corner and Monte Carlo mismatch simulations of the schematic and the physical circuit (postlayout) were conducted to evaluate the effect of transistor mismatches and process voltage temperature (PVT) variations and to showcase a proposed systematic method for offsetting their effect.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2172-2185"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Speed Compute-Efficient Bandit Learning for Many Arms 多兵种高速计算高效强盗学习
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-03 DOI: 10.1109/TVLSI.2025.3573924
Ishaan Sharma;Sumit J. Darak;Rohit Kumar
Multiarmed bandits (MABs) are online machine learning algorithms that aim to identify the optimal arm without prior statistical knowledge via the exploration-exploitation tradeoff. The performance metric, regret, and computational complexity of the MAB algorithms degrade with the increase in the number of arms, K. In applications such as wireless communication, radar systems, and sensor networks, K, i.e., the number of antennas, beams, bands, etc., is expected to be large. In this work, we consider focused exploration-based MAB, which outperforms conventional MAB for large K, and its mapping on various edge processors and multiprocessor system on a chip (MPSoC) via hardware-software co-design (HSCD) and fixed point (FP) analysis. The proposed architecture offers 67% reduction in average cumulative regret, 84% reduction in execution time on edge processor, 97% reduction in execution time using FPGA-based accelerator, and 10% savings in resources over state-of-the-art MABs for large $K=100$ .
Multiarmed bandits (mab)是一种在线机器学习算法,旨在通过探索-开发权衡,在没有事先统计知识的情况下识别最佳手臂。MAB算法的性能指标、遗憾率和计算复杂度随着臂数K的增加而降低。在无线通信、雷达系统和传感器网络等应用中,K(即天线、波束、频带等的数量)预计会很大。在这项工作中,我们考虑了基于重点探索的MAB,它在大K下优于传统的MAB,并通过硬件软件协同设计(HSCD)和固定点(FP)分析将其映射到各种边缘处理器和片上多处理器系统(MPSoC)上。所提出的架构可以减少67%的平均累积遗憾,在边缘处理器上减少84%的执行时间,使用基于fpga的加速器减少97%的执行时间,并且在K=100美元时,比最先进的mab节省10%的资源。
{"title":"High-Speed Compute-Efficient Bandit Learning for Many Arms","authors":"Ishaan Sharma;Sumit J. Darak;Rohit Kumar","doi":"10.1109/TVLSI.2025.3573924","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573924","url":null,"abstract":"Multiarmed bandits (MABs) are online machine learning algorithms that aim to identify the optimal arm without prior statistical knowledge via the exploration-exploitation tradeoff. The performance metric, regret, and computational complexity of the MAB algorithms degrade with the increase in the number of arms, <italic>K</i>. In applications such as wireless communication, radar systems, and sensor networks, <italic>K</i>, i.e., the number of antennas, beams, bands, etc., is expected to be large. In this work, we consider focused exploration-based MAB, which outperforms conventional MAB for large <italic>K</i>, and its mapping on various edge processors and multiprocessor system on a chip (MPSoC) via hardware-software co-design (HSCD) and fixed point (FP) analysis. The proposed architecture offers 67% reduction in average cumulative regret, 84% reduction in execution time on edge processor, 97% reduction in execution time using FPGA-based accelerator, and 10% savings in resources over state-of-the-art MABs for large <inline-formula> <tex-math>$K=100$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2099-2103"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Compact High-Speed Capacitive Data Transfer Link With Common Mode Transient Rejection for Isolated Sensor Interfaces 用于隔离传感器接口的具有共模瞬态抑制的紧凑型高速电容性数据传输链路
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-03 DOI: 10.1109/TVLSI.2025.3573226
Isa Altoobaji;Ahmad Hassan;Mohamed Ali;Yves Audet;Ahmed Lakhssassi
In this article, a compact differential data transfer link architecture for isolated sensor interfaces (SIs) and immune to common mode transients (CMTs) is presented. The proposed architecture shows low latency supporting high-speed transmission with a low bit error rate (BER) in the presence of CMT noise for applications, such as data acquisition, biomedical equipment, and communication networks. In transportation applications, motors and actuators are subjected to harsh environmental conditions, e.g., lightning strikes and abnormal voltage operations. These conditions introduce noise and can cause damage to small electronics due to high-voltage power surges. To ensure human safety and circuitry protection, a data transfer system must be implemented between high-voltage and low-voltage domains. The proposed design has been simulated using Cadence tools, and a prototype has been manufactured in a 0.18- $mu $ m CMOS process. The fabricated prototype consumes an effective silicon area of $37.2times 10^{3}~mu $ m2 and can sustain a breakdown voltage of 710 Vrms. Experimental results show that the proposed solution achieves a CMT immunity (CMTI) of 2.5 kV/ $mu $ s at a data rate of 480 Mb/s with a BER of $10^{-12}$ . The propagation delay is 3.9 ns with a 4 ps/°C variation rate over temperatures ranging from $- 31~^{circ }$ C to $100~^{circ }$ C. Under typical test conditions, the BER reaches $10^{-15}$ with a peak-to-peak data dependent jitter (DDJ) of 29.8 ps.
在本文中,提出了一个紧凑的差分数据传输链路结构,用于隔离传感器接口(si)和免疫共模瞬变(cmt)。所提出的架构具有低延迟,支持在CMT噪声存在下的高速传输和低误码率(BER),适用于数据采集、生物医学设备和通信网络等应用。在运输应用中,电机和执行器受到恶劣环境条件的影响,例如雷击和异常电压操作。这些条件会产生噪音,并可能由于高压电涌而损坏小型电子设备。为了确保人身安全和电路保护,必须在高压和低压域之间实现数据传输系统。所提出的设计已使用Cadence工具进行了模拟,并在0.18- $mu $ m CMOS工艺中制造了原型。制造的原型消耗了37.2 × 10^{3}~mu $ m2的有效硅面积,并能维持710 Vrms的击穿电压。实验结果表明,该方案在480 Mb/s的数据速率和10^{-12}$的误码率下实现了2.5 kV/ $mu $ s的CMT抗扰度(CMTI)。传输延迟为3.9 ns,在$- 31~ $ {circ}$ C到$100~ $ {circ}$ C的温度范围内的变化率为4 ps/°C。在典型测试条件下,误码达到$10^{-15}$,峰对峰数据相关抖动(DDJ)为29.8 ps。
{"title":"A Compact High-Speed Capacitive Data Transfer Link With Common Mode Transient Rejection for Isolated Sensor Interfaces","authors":"Isa Altoobaji;Ahmad Hassan;Mohamed Ali;Yves Audet;Ahmed Lakhssassi","doi":"10.1109/TVLSI.2025.3573226","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573226","url":null,"abstract":"In this article, a compact differential data transfer link architecture for isolated sensor interfaces (SIs) and immune to common mode transients (CMTs) is presented. The proposed architecture shows low latency supporting high-speed transmission with a low bit error rate (BER) in the presence of CMT noise for applications, such as data acquisition, biomedical equipment, and communication networks. In transportation applications, motors and actuators are subjected to harsh environmental conditions, e.g., lightning strikes and abnormal voltage operations. These conditions introduce noise and can cause damage to small electronics due to high-voltage power surges. To ensure human safety and circuitry protection, a data transfer system must be implemented between high-voltage and low-voltage domains. The proposed design has been simulated using Cadence tools, and a prototype has been manufactured in a 0.18-<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>m CMOS process. The fabricated prototype consumes an effective silicon area of <inline-formula> <tex-math>$37.2times 10^{3}~mu $ </tex-math></inline-formula>m<sup>2</sup> and can sustain a breakdown voltage of 710 V<sub>rms</sub>. Experimental results show that the proposed solution achieves a CMT immunity (CMTI) of 2.5 kV/<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>s at a data rate of 480 Mb/s with a BER of <inline-formula> <tex-math>$10^{-12}$ </tex-math></inline-formula>. The propagation delay is 3.9 ns with a 4 ps/°C variation rate over temperatures ranging from <inline-formula> <tex-math>$- 31~^{circ }$ </tex-math></inline-formula>C to <inline-formula> <tex-math>$100~^{circ }$ </tex-math></inline-formula>C. Under typical test conditions, the BER reaches <inline-formula> <tex-math>$10^{-15}$ </tex-math></inline-formula> with a peak-to-peak data dependent jitter (DDJ) of 29.8 ps.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2163-2171"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward High-Performance Network Coding: FPGA Acceleration With Bounded-Value Generators 迈向高性能网络编码:FPGA加速与有界值生成器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-02 DOI: 10.1109/TVLSI.2025.3572517
Jiaxin Qing;Philip H. W. Leong;Kin-Hong Lee;Raymond W. Yeung
The network coding enhances performance in network communications and distributed storage by increasing throughput and robustness while reducing latency. Batched sparse (BATS) codes are a class of capacity-achieving network codes, but their practical applications are hindered by their structure, computational intensity, and power demands of finite field (FF) operations. Most literature focuses on algorithmic-level techniques to improve the coding efficiency. Optimization with an algorithm/hardware co-designing approach has long been neglected. Leveraging the unique structure of BATS codes, we first present cyclic-shift BATS (CS-BATS), a hardware-friendly variant. Next, we propose a simple but effective bounded-value (BV) generator, to reduce the size of a finite field multiplier by up to 70%. Finally, we report on a scalable and resource-efficient field-programmable gate array (FPGA)-based network coding accelerator that achieves a throughput of 27 Gb/s, a speedup of more than 300 over software.
网络编码通过增加吞吐量和鲁棒性来提高网络通信和分布式存储的性能,同时减少延迟。批处理稀疏码(Batched sparse code, BATS)是一类容量实现型网络码,但其结构、计算强度和有限域(finite field, FF)运算的功率需求等限制了其实际应用。大多数文献关注于算法级技术来提高编码效率。长期以来,算法/硬件协同设计方法的优化一直被忽视。利用BATS代码的独特结构,我们首先提出了一种硬件友好型的循环移位BATS (CS-BATS)。接下来,我们提出了一个简单但有效的有界值(BV)生成器,以减少有限域乘法器的大小高达70%。最后,我们报告了一种可扩展且资源高效的基于现场可编程门阵列(FPGA)的网络编码加速器,其吞吐量达到27 Gb/s,比软件加速300以上。
{"title":"Toward High-Performance Network Coding: FPGA Acceleration With Bounded-Value Generators","authors":"Jiaxin Qing;Philip H. W. Leong;Kin-Hong Lee;Raymond W. Yeung","doi":"10.1109/TVLSI.2025.3572517","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3572517","url":null,"abstract":"The network coding enhances performance in network communications and distributed storage by increasing throughput and robustness while reducing latency. Batched sparse (BATS) codes are a class of capacity-achieving network codes, but their practical applications are hindered by their structure, computational intensity, and power demands of finite field (FF) operations. Most literature focuses on algorithmic-level techniques to improve the coding efficiency. Optimization with an algorithm/hardware co-designing approach has long been neglected. Leveraging the unique structure of BATS codes, we first present cyclic-shift BATS (CS-BATS), a hardware-friendly variant. Next, we propose a simple but effective bounded-value (BV) generator, to reduce the size of a finite field multiplier by up to 70%. Finally, we report on a scalable and resource-efficient field-programmable gate array (FPGA)-based network coding accelerator that achieves a throughput of 27 Gb/s, a speedup of more than 300 over software.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2274-2287"},"PeriodicalIF":2.8,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Compact Power-on-Reset Circuit With Configurable Brown-Out Detection 一个紧凑的电源上电复位电路与可配置的停电检测
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-30 DOI: 10.1109/TVLSI.2025.3561131
Yoochang Kim;Jun-Eun Park;Kwanseo Park;Young-Ha Hwang
A compact power-on-reset (POR) circuit with a configurable brown-out reset (BOR) function is presented. An integrated voltage reference (VR) circuit provides a constant bias voltage that facilitates voltage-triggered POR/BOR operation, reliably preventing POR signal generation when the ramping supply voltage ( ${V} _{text {DD}}$ ) level is too low. Moreover, the proposed POR circuit features a fast, configurable POR/BOR operation owing to an inverter-based trip point detector (TPD), which triggers the reset signal with a programmable trip point. The prototype POR circuit achieves a POR level higher than 752 mV with a maximum POR delay of $16.4~mu $ s at a 0.8–1.2-V ${V} _{text {DD}}$ , supporting a wide range of supply ramping time from $1~mu $ s to 1 s. In addition, the prototype detects brown-out events with a supply drop of 0.1–0.4 V, generating the BOR signal. Designed using a 28-nm CMOS process, the prototype has a compact active area of $995.3~mu $ m2 and a quiescent current of 162–974 nA at a 1-V ${V} _{text {DD}}$ .
提出了一种具有可配置断电复位功能的紧凑型上电复位(POR)电路。集成的基准电压(VR)电路提供恒定的偏置电压,促进电压触发的POR/BOR操作,当斜坡电源电压(${V} _{text {DD}}$)水平过低时可靠地防止POR信号的产生。此外,所提出的POR电路具有快速,可配置的POR/BOR操作,由于基于逆变器的跳闸点检测器(TPD),它触发复位信号与可编程的跳闸点。原型POR电路在0.8 - 1.2 V {V} _{text {DD}}$下实现了高于752 mV的POR电平,最大POR延迟为16.4~mu $ s,支持从$1~mu $ s到1 s的宽范围供电斜坡时间。此外,该原型检测到电源下降0.1-0.4 V的断电事件,产生BOR信号。该样机采用28纳米CMOS工艺设计,有效面积为995.3~mu $ m2,静态电流为162-974 nA,电压为1 V ${V} _{text {DD}}。
{"title":"A Compact Power-on-Reset Circuit With Configurable Brown-Out Detection","authors":"Yoochang Kim;Jun-Eun Park;Kwanseo Park;Young-Ha Hwang","doi":"10.1109/TVLSI.2025.3561131","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3561131","url":null,"abstract":"A compact power-on-reset (POR) circuit with a configurable brown-out reset (BOR) function is presented. An integrated voltage reference (VR) circuit provides a constant bias voltage that facilitates voltage-triggered POR/BOR operation, reliably preventing POR signal generation when the ramping supply voltage (<inline-formula> <tex-math>${V} _{text {DD}}$ </tex-math></inline-formula>) level is too low. Moreover, the proposed POR circuit features a fast, configurable POR/BOR operation owing to an inverter-based trip point detector (TPD), which triggers the reset signal with a programmable trip point. The prototype POR circuit achieves a POR level higher than 752 mV with a maximum POR delay of <inline-formula> <tex-math>$16.4~mu $ </tex-math></inline-formula>s at a 0.8–1.2-V <inline-formula> <tex-math>${V} _{text {DD}}$ </tex-math></inline-formula>, supporting a wide range of supply ramping time from <inline-formula> <tex-math>$1~mu $ </tex-math></inline-formula>s to 1 s. In addition, the prototype detects brown-out events with a supply drop of 0.1–0.4 V, generating the BOR signal. Designed using a 28-nm CMOS process, the prototype has a compact active area of <inline-formula> <tex-math>$995.3~mu $ </tex-math></inline-formula>m<sup>2</sup> and a quiescent current of 162–974 nA at a 1-V <inline-formula> <tex-math>${V} _{text {DD}}$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2074-2078"},"PeriodicalIF":2.8,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Partial Recomputation-Based Fault Detection Approaches for Z-transform 基于局部重计算的z变换故障检测方法
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-30 DOI: 10.1109/TVLSI.2025.3560154
Saeed Aghapour;Kasra Ahmadi;Mehran Mozaffari Kermani;Reza Azarderakhsh
The Z-transform is a fundamental and strong tool being widely utilized in signal processing and various other applications such as communications and networking. By analyzing the Z-transform of a signal, one can extract critical information about its stability, causality, frequency response, energy and power, and overall behavior of the signal. However, errors caused either by environmental changes or malicious injections in large-scale integration (VLSI) implementations can critically compromise the integrity and reliability of its output. Failure to detect such faults may result in unpredictable, erroneous, and misleading function analyses. Therefore, the ability to detect soft errors and faults before accepting the results is of paramount importance. In this article, we propose an efficient fault detection method that combines algorithmic-level checks with partial recomputation to identify both transient and permanent faults with a high error coverage rate across various injection scenarios. The AMD/Xilinx field-programmable gate array (FPGA) implementation of our design demonstrated only a modest increase in time and area overhead. To the best of our knowledge, fault detection for the Z-transform function has not been previously studied.
z变换是一个基本和强大的工具,被广泛应用于信号处理和各种其他应用,如通信和网络。通过分析信号的z变换,可以提取有关其稳定性、因果关系、频率响应、能量和功率以及信号总体行为的关键信息。然而,在大规模集成(VLSI)实现中,由环境变化或恶意注入引起的错误可能严重损害其输出的完整性和可靠性。如果不能检测到这些故障,可能会导致不可预测的、错误的和误导性的功能分析。因此,在接受结果之前检测软错误和故障的能力至关重要。在本文中,我们提出了一种有效的故障检测方法,该方法将算法级检查与部分重新计算相结合,以识别瞬态和永久故障,并在各种注入场景中具有较高的错误覆盖率。我们设计的AMD/Xilinx现场可编程门阵列(FPGA)实现仅显示了时间和面积开销的适度增加。据我们所知,以前还没有研究过z变换函数的故障检测。
{"title":"Efficient Partial Recomputation-Based Fault Detection Approaches for Z-transform","authors":"Saeed Aghapour;Kasra Ahmadi;Mehran Mozaffari Kermani;Reza Azarderakhsh","doi":"10.1109/TVLSI.2025.3560154","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3560154","url":null,"abstract":"The Z-transform is a fundamental and strong tool being widely utilized in signal processing and various other applications such as communications and networking. By analyzing the Z-transform of a signal, one can extract critical information about its stability, causality, frequency response, energy and power, and overall behavior of the signal. However, errors caused either by environmental changes or malicious injections in large-scale integration (VLSI) implementations can critically compromise the integrity and reliability of its output. Failure to detect such faults may result in unpredictable, erroneous, and misleading function analyses. Therefore, the ability to detect soft errors and faults before accepting the results is of paramount importance. In this article, we propose an efficient fault detection method that combines algorithmic-level checks with partial recomputation to identify both transient and permanent faults with a high error coverage rate across various injection scenarios. The AMD/Xilinx field-programmable gate array (FPGA) implementation of our design demonstrated only a modest increase in time and area overhead. To the best of our knowledge, fault detection for the Z-transform function has not been previously studied.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1983-1993"},"PeriodicalIF":2.8,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144581497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Universal Sequential Authentication Scheme for TAPC-Based Test Standards 基于tapc测试标准的通用顺序认证方案
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-29 DOI: 10.1109/TVLSI.2025.3562015
Guan-Rong Chen;Kuen-Jong Lee
Integrated circuits (ICs) have become extremely complex nowadays. Therefore, multiple test standards could be employed to handle different testing scenarios. Unfortunately, this also leads to serious security problems since attackers can exploit the excellent controllability and observability of test standards to steal confidential information or disrupt the circuit’s functionality. This article proposes a universal sequential authentication scheme that is compatible with test standards employing the test access port controller (TAPC) defined in IEEE Std 1149.1. The main objective is to protect multiple TAPC-based test standards with a universal security module. In this scheme, only authorized test data can be updated to the target register to control the corresponding test standard, and only the response to authorized test data can be output. The key idea is to generate different authentication keys for different test data, and even with the same set of test data, if their input sequences are different, their authentication keys will also be different. Furthermore, we develop an irreversible obfuscation mechanism to generate fake output data to confuse attackers. Due to its irreversibility, the original correct output data cannot be deduced from the fake output data. Experimental results on a typical processor, i.e., SCR1, show that the proposed scheme causes no time overhead, and the area overhead is only 1.74%.
集成电路(ic)如今已经变得极其复杂。因此,可以使用多个测试标准来处理不同的测试场景。不幸的是,这也会导致严重的安全问题,因为攻击者可以利用测试标准出色的可控性和可观察性来窃取机密信息或破坏电路的功能。本文提出了一种通用的顺序认证方案,该方案采用IEEE标准1149.1中定义的测试访问端口控制器(TAPC),与测试标准兼容。主要目标是使用通用安全模块保护多个基于tapc的测试标准。在该方案中,只有授权的测试数据才能更新到目标寄存器中以控制相应的测试标准,并且只有对授权的测试数据的响应才能输出。其关键思想是为不同的测试数据生成不同的认证密钥,即使是同一组测试数据,如果它们的输入序列不同,它们的认证密钥也会不同。此外,我们开发了一种不可逆的混淆机制来生成虚假输出数据以混淆攻击者。由于其不可逆性,无法从伪输出数据中推导出原始的正确输出数据。在典型处理器SCR1上的实验结果表明,该方案不会造成时间开销,面积开销仅为1.74%。
{"title":"A Universal Sequential Authentication Scheme for TAPC-Based Test Standards","authors":"Guan-Rong Chen;Kuen-Jong Lee","doi":"10.1109/TVLSI.2025.3562015","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3562015","url":null,"abstract":"Integrated circuits (ICs) have become extremely complex nowadays. Therefore, multiple test standards could be employed to handle different testing scenarios. Unfortunately, this also leads to serious security problems since attackers can exploit the excellent controllability and observability of test standards to steal confidential information or disrupt the circuit’s functionality. This article proposes a universal sequential authentication scheme that is compatible with test standards employing the test access port controller (TAPC) defined in IEEE Std 1149.1. The main objective is to protect multiple TAPC-based test standards with a universal security module. In this scheme, only authorized test data can be updated to the target register to control the corresponding test standard, and only the response to authorized test data can be output. The key idea is to generate different authentication keys for different test data, and even with the same set of test data, if their input sequences are different, their authentication keys will also be different. Furthermore, we develop an irreversible obfuscation mechanism to generate fake output data to confuse attackers. Due to its irreversibility, the original correct output data cannot be deduced from the fake output data. Experimental results on a typical processor, i.e., SCR1, show that the proposed scheme causes no time overhead, and the area overhead is only 1.74%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1972-1982"},"PeriodicalIF":2.8,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Novel High-Throughput FFT Processor With a Block-Level Pipeline for 5G MIMO OFDM Systems 5G MIMO OFDM系统中一种具有块级管道的新型高吞吐量FFT处理器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-28 DOI: 10.1109/TVLSI.2025.3558947
Meiyu Liu;Zhijun Wang;Hanqing Luo;Shengnan Lin;Liping Liang
In fifth-generation (5G) communication systems, multiple input multiple output (MIMO) and orthogonal frequency-division multiplexing (OFDM) are two critical technologies. Fast Fourier transform (FFT), as the core processing steps of OFDM, directly affects the overall system performance. In this brief, we proposed a novel block-level pipelined architecture, which divides the FFT processor into three pipeline blocks: input, radix, and output. Each pipeline block can run in a different FFT simultaneously to achieve higher throughput. Specifically, to reduce the OFDM system-level latency of 5G applications, the FFT processor supports weighted overlap and add (WOLA) on the cyclic prefix and suffix of OFDM symbols. This architecture is implemented using TSMC 12-nm technology, with a processor die area of 0.89 mm2 and a power consumption of 568 mW at 1 GHz. The FFT processor can achieve a system-level throughput up to 2.66 GS/s.
在第五代(5G)通信系统中,多输入多输出(MIMO)和正交频分复用(OFDM)是两项关键技术。快速傅里叶变换(FFT)作为OFDM的核心处理步骤,直接影响系统的整体性能。在本文中,我们提出了一种新的块级流水线架构,它将FFT处理器划分为三个流水线块:输入、基数和输出。每个管道块可以同时在不同的FFT中运行,以实现更高的吞吐量。具体来说,为了降低5G应用的OFDM系统级延迟,FFT处理器支持OFDM符号循环前缀和后缀的加权重叠和添加(WOLA)。该架构采用台积电12纳米技术实现,处理器芯片面积为0.89 mm2, 1ghz时功耗为568 mW。FFT处理器可以实现高达2.66 GS/s的系统级吞吐量。
{"title":"A Novel High-Throughput FFT Processor With a Block-Level Pipeline for 5G MIMO OFDM Systems","authors":"Meiyu Liu;Zhijun Wang;Hanqing Luo;Shengnan Lin;Liping Liang","doi":"10.1109/TVLSI.2025.3558947","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558947","url":null,"abstract":"In fifth-generation (5G) communication systems, multiple input multiple output (MIMO) and orthogonal frequency-division multiplexing (OFDM) are two critical technologies. Fast Fourier transform (FFT), as the core processing steps of OFDM, directly affects the overall system performance. In this brief, we proposed a novel block-level pipelined architecture, which divides the FFT processor into three pipeline blocks: input, radix, and output. Each pipeline block can run in a different FFT simultaneously to achieve higher throughput. Specifically, to reduce the OFDM system-level latency of 5G applications, the FFT processor supports weighted overlap and add (WOLA) on the cyclic prefix and suffix of OFDM symbols. This architecture is implemented using TSMC 12-nm technology, with a processor die area of 0.89 mm<sup>2</sup> and a power consumption of 568 mW at 1 GHz. The FFT processor can achieve a system-level throughput up to 2.66 GS/s.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2059-2063"},"PeriodicalIF":2.8,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1