首页 > 最新文献

IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

英文 中文
An Implementation of Reconfigurable Match Table for FPGA-Based Programmable Switches 基于 FPGA 的可编程开关的可重构匹配表的实现
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-07 DOI: 10.1109/TVLSI.2024.3436047
Xiaoyong Song;Zhichuan Guo
Match table is the key part to perform packet processing and forwarding for programmable switches in a software-defined network (SDN). However, the match table in current field-programmable gate array (FPGA)-based switches is inflexible or undisclosed. When the network function changes, the match table on FPGA needs to be redesigned or reset size parameters, and after recompilation and reimplementation, it could work again; this time-consuming and labor-intensive operation seriously reduces the flexibility and configurability of the switch. To address this issue, this article presents a design of reconfigurable match table (RMT) for FPGA-based programmable switches. A three-layer table structure is introduced to realize the reconfiguration and hardware-plane mapping of user-defined tables, and the logical tables in packet processing pipeline are interconnected with the physical tables in memory pool by the designed resource-efficient segment crossbar. To the best of our knowledge, this article is the first to publicly present the entire FPGA-based RMT design scheme and implementation details. The proposed design implements reconfigurable ternary content addressable memory (TCAM) based and static random access memory (SRAM) based match tables on Xilinx FPGA and verifies them with a packet filter system. In the proposed RMT system, a user could reconfigure the number, depth, and width of user-defined match tables (UMTs) in pipeline via control plane without modifying hardware, which enhances the flexibility of the data plane of FPGA-based switch greatly.
匹配表是软件定义网络(SDN)中可编程交换机进行数据包处理和转发的关键部分。然而,目前基于现场可编程门阵列(FPGA)的交换机的匹配表不够灵活或未公开。当网络功能发生变化时,FPGA 上的匹配表需要重新设计或重置大小参数,经过重新编译和重新实现后才能再次工作;这种耗时耗力的操作严重降低了交换机的灵活性和可配置性。针对这一问题,本文提出了一种基于 FPGA 的可编程交换机的可重构匹配表(RMT)设计。本文引入了三层表结构来实现用户定义表的重新配置和硬件平面映射,并通过设计的资源节约型段交叉条将数据包处理流水线中的逻辑表与内存池中的物理表互连。据我们所知,本文是首次公开介绍整个基于 FPGA 的 RMT 设计方案和实现细节。所提出的设计在 Xilinx FPGA 上实现了基于可重构三元内容可寻址存储器(TCAM)和基于静态随机存取存储器(SRAM)的匹配表,并通过数据包过滤系统进行了验证。在所提出的 RMT 系统中,用户可以通过控制平面重新配置管道中用户定义匹配表 (UMT) 的数量、深度和宽度,而无需修改硬件,这大大提高了基于 FPGA 的交换机数据平面的灵活性。
{"title":"An Implementation of Reconfigurable Match Table for FPGA-Based Programmable Switches","authors":"Xiaoyong Song;Zhichuan Guo","doi":"10.1109/TVLSI.2024.3436047","DOIUrl":"10.1109/TVLSI.2024.3436047","url":null,"abstract":"Match table is the key part to perform packet processing and forwarding for programmable switches in a software-defined network (SDN). However, the match table in current field-programmable gate array (FPGA)-based switches is inflexible or undisclosed. When the network function changes, the match table on FPGA needs to be redesigned or reset size parameters, and after recompilation and reimplementation, it could work again; this time-consuming and labor-intensive operation seriously reduces the flexibility and configurability of the switch. To address this issue, this article presents a design of reconfigurable match table (RMT) for FPGA-based programmable switches. A three-layer table structure is introduced to realize the reconfiguration and hardware-plane mapping of user-defined tables, and the logical tables in packet processing pipeline are interconnected with the physical tables in memory pool by the designed resource-efficient segment crossbar. To the best of our knowledge, this article is the first to publicly present the entire FPGA-based RMT design scheme and implementation details. The proposed design implements reconfigurable ternary content addressable memory (TCAM) based and static random access memory (SRAM) based match tables on Xilinx FPGA and verifies them with a packet filter system. In the proposed RMT system, a user could reconfigure the number, depth, and width of user-defined match tables (UMTs) in pipeline via control plane without modifying hardware, which enhances the flexibility of the data plane of FPGA-based switch greatly.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 11","pages":"2121-2134"},"PeriodicalIF":2.8,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A High-Precision and High-Dynamic-Range Current-Mode WTA Circuit for Low-Supply-Voltage Applications 用于低电压应用的高精度、高动态范围电流模式 WTA 电路
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-07 DOI: 10.1109/TVLSI.2024.3436575
Mehdi Saberi;Hossein Yaghoobzadeh Shadmehri;Mohammad Tavakkoli Ghouchani;Alexandre Schmid
This brief proposes a low-voltage, high-precision, and high-dynamic-range current-mode analog winner-take-all (WTA) circuit. The proposed structure employs a new high-gain stage as a feedback network between the input node of each cell and the common node of the circuit to reduce the sensitivity of the output current to the loser signals, especially when they are close to the winner. In addition, another network is employed that senses the amount of the output/winner current and adjusts the bias current of the gain stages. This ensures that the drain-source voltage of the input transistor in the winner cell matches the behavior of the output transistor’s drain-source voltage, enhancing the accuracy as well as the input dynamic range (DR) of the structure. Moreover, since the circuit works properly with a minimum supply voltage of only $V_{text {GS}} + V_{text {eff}}$ , it is a promising candidate for applications in emerging technologies with low supply voltage requirements. Based on the proposed structure, a three-input WTA circuit is designed and fabricated in a 0.18- $mu $ m CMOS technology. According to the measurement results, the proposed circuit exhibits a maximum error of 1.5% for the input signal range of $60~mu $ A when the input frequency is 100 kHz. The silicon area occupied by the circuit is $33~mu $ m $times 65~mu $ m.
本简介提出了一种低电压、高精度和高动态范围的电流模式模拟赢家通吃(WTA)电路。所提出的结构采用了一个新的高增益级作为每个单元输入节点和电路公共节点之间的反馈网络,以降低输出电流对输家信号的敏感度,尤其是当输家信号接近赢家信号时。此外,还采用了另一个网络来检测输出/胜者电流的大小,并调整增益级的偏置电流。这确保了赢家单元中输入晶体管的漏极-源极电压与输出晶体管的漏极-源极电压相匹配,从而提高了结构的精度和输入动态范围 (DR)。此外,由于电路在最低电源电压仅为 $V_{text {GS}} 的情况下也能正常工作。+ V_{text {eff}}$,因此它有望应用于对电源电压要求较低的新兴技术中。根据所提出的结构,设计了一个三输入 WTA 电路,并在 0.18- $mu $ m CMOS 技术中制作完成。测量结果表明,当输入频率为 100 kHz 时,输入信号范围为 $60~mu $ A 时,所提电路的最大误差为 1.5%。电路所占用的硅面积为 33~mu $ m,65~mu $ m。
{"title":"A High-Precision and High-Dynamic-Range Current-Mode WTA Circuit for Low-Supply-Voltage Applications","authors":"Mehdi Saberi;Hossein Yaghoobzadeh Shadmehri;Mohammad Tavakkoli Ghouchani;Alexandre Schmid","doi":"10.1109/TVLSI.2024.3436575","DOIUrl":"10.1109/TVLSI.2024.3436575","url":null,"abstract":"This brief proposes a low-voltage, high-precision, and high-dynamic-range current-mode analog winner-take-all (WTA) circuit. The proposed structure employs a new high-gain stage as a feedback network between the input node of each cell and the common node of the circuit to reduce the sensitivity of the output current to the loser signals, especially when they are close to the winner. In addition, another network is employed that senses the amount of the output/winner current and adjusts the bias current of the gain stages. This ensures that the drain-source voltage of the input transistor in the winner cell matches the behavior of the output transistor’s drain-source voltage, enhancing the accuracy as well as the input dynamic range (DR) of the structure. Moreover, since the circuit works properly with a minimum supply voltage of only \u0000<inline-formula> <tex-math>$V_{text {GS}} + V_{text {eff}}$ </tex-math></inline-formula>\u0000, it is a promising candidate for applications in emerging technologies with low supply voltage requirements. Based on the proposed structure, a three-input WTA circuit is designed and fabricated in a 0.18-\u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000 m CMOS technology. According to the measurement results, the proposed circuit exhibits a maximum error of 1.5% for the input signal range of \u0000<inline-formula> <tex-math>$60~mu $ </tex-math></inline-formula>\u0000 A when the input frequency is 100 kHz. The silicon area occupied by the circuit is \u0000<inline-formula> <tex-math>$33~mu $ </tex-math></inline-formula>\u0000 m \u0000<inline-formula> <tex-math>$times 65~mu $ </tex-math></inline-formula>\u0000 m.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 10","pages":"1955-1958"},"PeriodicalIF":2.8,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Thresholding Decision-Directed Descent (T3D): A Tuning Solution for DDR5 DRAM DFEs 阈值决策定向下降 (T3D):DDR5 DRAM DFE 的调整解决方案
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-06 DOI: 10.1109/TVLSI.2024.3435419
Mitchell Cooke;Nicola Nicolici
Emerging memory technologies, such as DDR5, offer increased data rates and storage capacities, at the expense of signal integrity challenges. To address these challenges, the DDR5 standard incorporates a four-tap decision feedback equalizer (DFE). As elaborated in this article, known methods for DFE tuning are limited due to interface complexity and distinct equalization requirements for DDR5. We propose a decision-directed DFE tuning method called thresholding decision-directed descent (T3D). By leveraging DDR5 architectural features, our novel method tracks the eye envelope as it opens, which facilitates rapid convergence compared to the state of the art. To validate the performance of T3D, silicon measurements are presented alongside a virtual testbench methodology. By demonstrating the high correlation between silicon and simulation results, the virtual testbench can be beneficial for the design, validation, and prototyping of future DFE tuning methods.
DDR5 等新兴内存技术提高了数据传输速率和存储容量,但也带来了信号完整性方面的挑战。为了应对这些挑战,DDR5 标准采用了四抽头决策反馈均衡器(DFE)。正如本文所阐述的,由于接口的复杂性和 DDR5 独特的均衡要求,已知的 DFE 调整方法受到了限制。我们提出了一种称为阈值决策定向下降(T3D)的决策定向 DFE 调整方法。通过利用 DDR5 架构特性,我们的新方法可在眼球包络打开时跟踪眼球包络,与现有技术相比,收敛速度更快。为了验证 T3D 的性能,在介绍虚拟测试台方法的同时,还介绍了硅测量结果。通过证明硅片和仿真结果之间的高度相关性,虚拟测试台有助于未来 DFE 调整方法的设计、验证和原型开发。
{"title":"Thresholding Decision-Directed Descent (T3D): A Tuning Solution for DDR5 DRAM DFEs","authors":"Mitchell Cooke;Nicola Nicolici","doi":"10.1109/TVLSI.2024.3435419","DOIUrl":"10.1109/TVLSI.2024.3435419","url":null,"abstract":"Emerging memory technologies, such as DDR5, offer increased data rates and storage capacities, at the expense of signal integrity challenges. To address these challenges, the DDR5 standard incorporates a four-tap decision feedback equalizer (DFE). As elaborated in this article, known methods for DFE tuning are limited due to interface complexity and distinct equalization requirements for DDR5. We propose a decision-directed DFE tuning method called thresholding decision-directed descent (T3D). By leveraging DDR5 architectural features, our novel method tracks the eye envelope as it opens, which facilitates rapid convergence compared to the state of the art. To validate the performance of T3D, silicon measurements are presented alongside a virtual testbench methodology. By demonstrating the high correlation between silicon and simulation results, the virtual testbench can be beneficial for the design, validation, and prototyping of future DFE tuning methods.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 11","pages":"2060-2073"},"PeriodicalIF":2.8,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Performance Error and Erasure Decoding With Low Complexities Using SPC-RS Concatenated Codes 利用 SPC-RS 连接编码实现低复杂度的高性能纠错和擦除解码
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-06 DOI: 10.1109/tvlsi.2024.3435773
Zhihao Zhou, Wei Zhang, Xinyi Guo, Jianhan Zhao, Yanyan Liu
{"title":"High-Performance Error and Erasure Decoding With Low Complexities Using SPC-RS Concatenated Codes","authors":"Zhihao Zhou, Wei Zhang, Xinyi Guo, Jianhan Zhao, Yanyan Liu","doi":"10.1109/tvlsi.2024.3435773","DOIUrl":"https://doi.org/10.1109/tvlsi.2024.3435773","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"41 1","pages":""},"PeriodicalIF":2.8,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low-Jitter Frequency Doubling Circuit Supporting Higher-Speed BISG and Aging Sensing in a Chiplet-Based Design Environment 基于 Chiplet 的设计环境中支持更高速 BISG 和老化传感的低抖动倍频电路
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-05 DOI: 10.1109/tvlsi.2024.3435059
Ko-Hong Lin, Ont-Derh Lin, Shi-Yu Huang, Duo Sheng
{"title":"Low-Jitter Frequency Doubling Circuit Supporting Higher-Speed BISG and Aging Sensing in a Chiplet-Based Design Environment","authors":"Ko-Hong Lin, Ont-Derh Lin, Shi-Yu Huang, Duo Sheng","doi":"10.1109/tvlsi.2024.3435059","DOIUrl":"https://doi.org/10.1109/tvlsi.2024.3435059","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"57 1","pages":""},"PeriodicalIF":2.8,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A High-Speed Dynamic Element Matching Decoder With Integrated Background Calibration Control 集成背景校准控制的高速动态元素匹配解码器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-30 DOI: 10.1109/TVLSI.2024.3432640
Tobias Schirmer;Simon Buhr;Felix Burkhardt;Florian Protze;Frank Ellinger
A dynamic element matching (DEM) decoder with integrated mismatch calibration control for high-speed current-steering digital-to-analog converters (CS-DACs) and CSDAC- based direct digital frequency synthesizers (DDFSs) is studied and presented. The DEM algorithm achieves very good averaging of mismatch-induced errors in the succeeding CS-DAC. It features a minimum element transition rate, therefore opimizing the power dissipation and ensuring minimal glitch energy at the output. Due to the chosen network-based architecture, with only a few modifications of the hardware, the decoder allows the integration of a comprehensive current source mismatch calibration that can be fully operated in the background and even in parallel to the regular DEM operation. A proof-ofconcept hardware implementation of the presented decoder was fabricated in a 22-nm FD-SOI CMOS process and characterized in a high-speed DDFS system with a sampling rate of 5 GHz. Measurements reveal a significant improvement in the spurious free dynamic range (SFDR) and signal-to-noise-and-distortion ratio (SNDR) when the calibration and DEM are enabled. Compared to the state-of-the-art (SoA), the presented DDFS achieves one of the best figures of merit.
针对高速电流转向数模转换器(CS-DAC)和基于 CSDAC 的直接数字频率合成器(DDFS),研究并提出了一种集成了失配校准控制的动态元素匹配(DEM)解码器。DEM 算法能很好地平均后继 CS-DAC 中失配引起的误差。该算法具有最小的元件转换率,因此可以最大限度地降低功耗,并确保输出端的闪烁能量最小。由于选择了基于网络的架构,只需对硬件进行少量修改,解码器就能集成全面的电流源失配校准功能,该功能可在后台完全运行,甚至与常规 DEM 运行并行。所介绍解码器的概念验证硬件实现采用 22 纳米 FD-SOI CMOS 工艺制造,并在采样率为 5 GHz 的高速 DDFS 系统中进行了鉴定。测量结果表明,启用校准和 DEM 后,无杂散动态范围 (SFDR) 和信噪比 (SNDR) 有了显著改善。与最新技术(SoA)相比,所提出的 DDFS 达到了最佳性能指标之一。
{"title":"A High-Speed Dynamic Element Matching Decoder With Integrated Background Calibration Control","authors":"Tobias Schirmer;Simon Buhr;Felix Burkhardt;Florian Protze;Frank Ellinger","doi":"10.1109/TVLSI.2024.3432640","DOIUrl":"10.1109/TVLSI.2024.3432640","url":null,"abstract":"A dynamic element matching (DEM) decoder with integrated mismatch calibration control for high-speed current-steering digital-to-analog converters (CS-DACs) and CSDAC- based direct digital frequency synthesizers (DDFSs) is studied and presented. The DEM algorithm achieves very good averaging of mismatch-induced errors in the succeeding CS-DAC. It features a minimum element transition rate, therefore opimizing the power dissipation and ensuring minimal glitch energy at the output. Due to the chosen network-based architecture, with only a few modifications of the hardware, the decoder allows the integration of a comprehensive current source mismatch calibration that can be fully operated in the background and even in parallel to the regular DEM operation. A proof-ofconcept hardware implementation of the presented decoder was fabricated in a 22-nm FD-SOI CMOS process and characterized in a high-speed DDFS system with a sampling rate of 5 GHz. Measurements reveal a significant improvement in the spurious free dynamic range (SFDR) and signal-to-noise-and-distortion ratio (SNDR) when the calibration and DEM are enabled. Compared to the state-of-the-art (SoA), the presented DDFS achieves one of the best figures of merit.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 11","pages":"2074-2084"},"PeriodicalIF":2.8,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141867464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks 用于加速变压器前馈网络的高效两级流水线内存计算宏程序
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-29 DOI: 10.1109/TVLSI.2024.3432403
Heng Zhang;Wenhe Yin;Sunan He;Yuan Du;Li Du
Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose a two-stage pipelined compute-in-memory (CIM) macro for effectively deploying and accelerating the feed-forward network (FFN) layers of transformer models. Two independent CIM arrays are designed to execute the two distinct linear projections in FFN layers, which are interconnected by co-designed analog rectified linear unit (ReLU) circuits to realize the nonlinear activation function. The analog multiply-and-add (MAC) results from the first CIM array are streamed directly to the analog ReLU circuits, and subsequently to the next CIM array for performing another linear projection. This architecture eliminates the need for analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for internal results’ staging, thereby enhancing overall macro efficiency and reducing computing latency. A proof-of-concept macro is fabricated using TSMC 65-nm process and achieves 4.096 TOPS peak throughput, 4.39 TOPS/mm2 area efficiency, and 49.83 TOPS/W energy efficiency. To map transformer models onto the proposed macro, we quantize the FFN layers of BERTMINI model under per-token granularity for activations and per-tensor granularity for weights using quantization-aware training (QAT), which exhibits excellent accuracy across multiple benchmarks.
变压器架构已在各种应用中实现了最先进的性能。然而,由于其动态工作负载、密集计算和大量内存访问,在资源受限的平台上部署变压器模型仍具有挑战性。在本文中,我们提出了一种两级流水线内存计算(CIM)宏,用于有效部署和加速变压器模型的前馈网络(FFN)层。我们设计了两个独立的 CIM 阵列来执行前馈网络层中两个不同的线性投影,它们通过共同设计的模拟整流线性单元 (ReLU) 电路相互连接,以实现非线性激活函数。第一个 CIM 阵列的模拟乘加(MAC)结果直接流向模拟 ReLU 电路,随后流向下一个 CIM 阵列,以执行另一个线性投影。这种架构无需使用模数转换器(ADC)和数模转换器(DAC)进行内部结果分期,从而提高了整体宏效率并减少了计算延迟。概念验证宏采用台积电 65 纳米工艺制造,实现了 4.096 TOPS 的峰值吞吐量、4.39 TOPS/mm2 的面积效率和 49.83 TOPS/W 的能效。为了将变压器模型映射到所提出的宏上,我们使用量化感知训练(QAT)对 BERTMINI 模型的 FFN 层进行量化,激活度按标记粒度计算,权重按张量粒度计算。
{"title":"An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks","authors":"Heng Zhang;Wenhe Yin;Sunan He;Yuan Du;Li Du","doi":"10.1109/TVLSI.2024.3432403","DOIUrl":"10.1109/TVLSI.2024.3432403","url":null,"abstract":"Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose a two-stage pipelined compute-in-memory (CIM) macro for effectively deploying and accelerating the feed-forward network (FFN) layers of transformer models. Two independent CIM arrays are designed to execute the two distinct linear projections in FFN layers, which are interconnected by co-designed analog rectified linear unit (ReLU) circuits to realize the nonlinear activation function. The analog multiply-and-add (MAC) results from the first CIM array are streamed directly to the analog ReLU circuits, and subsequently to the next CIM array for performing another linear projection. This architecture eliminates the need for analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for internal results’ staging, thereby enhancing overall macro efficiency and reducing computing latency. A proof-of-concept macro is fabricated using TSMC 65-nm process and achieves 4.096 TOPS peak throughput, 4.39 TOPS/mm2 area efficiency, and 49.83 TOPS/W energy efficiency. To map transformer models onto the proposed macro, we quantize the FFN layers of BERTMINI model under per-token granularity for activations and per-tensor granularity for weights using quantization-aware training (QAT), which exhibits excellent accuracy across multiple benchmarks.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 10","pages":"1889-1899"},"PeriodicalIF":2.8,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141867412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Low-Cost Quadruple-Node-Upsets Resilient Latch Design 低成本四重节点镦粗弹性锁存器设计
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-29 DOI: 10.1109/TVLSI.2024.3430224
Luchang He;Chenchen Xie;Qingyu Wu;Siqiu Xu;Houpeng Chen;Xing Ding;Xi Li;Zhitang Song
In this article, a low-cost quadruple-node-upsets resilient latch (LCQRL) design is proposed. To meet the high-reliability demands of safety-critical applications, the latch integrates nine soft-error-interceptive modules (SIMs) to form robust feedback loops, ensuring complete resilience to quadruple-node upsets (QNUs). Each Sim comprises ten CMOS transistors and a clocked inverter. Notably, C-element (CE) and dual interlocked storage cell (DICE) modules are not employed in this circuit, resulting in a small area and low power consumption. The simulation results verify the complete QNU self-recoverability and cost-effectiveness of this design. Compared with the existing radiation-hardened QNU resilient latches, the LCQRL latch demonstrates significant improvements in area, power consumption, and area-power–delay product (APDP) by 47.8%, 63%, and 75.5%, respectively. Furthermore, it exhibits low sensitivity to process, voltage, and temperature (PVT) variations.
本文提出了一种低成本四节点复位弹性锁存器(LCQRL)设计。为了满足安全关键型应用的高可靠性要求,该锁存器集成了九个软误差感知模块(SIM),以形成稳健的反馈回路,确保对四节点中断(QNU)具有完全的恢复能力。每个 SIM 由十个 CMOS 晶体管和一个时钟反相器组成。值得注意的是,该电路没有采用 C 元素(CE)和双互锁存储单元(DICE)模块,因此面积小、功耗低。仿真结果验证了该设计的完整 QNU 自恢复能力和成本效益。与现有的辐射加固型 QNU 弹性锁存器相比,LCQRL 锁存器在面积、功耗和面积-功耗-延迟积(APDP)方面都有显著改善,分别提高了 47.8%、63% 和 75.5%。此外,它对工艺、电压和温度(PVT)变化的敏感性也很低。
{"title":"A Low-Cost Quadruple-Node-Upsets Resilient Latch Design","authors":"Luchang He;Chenchen Xie;Qingyu Wu;Siqiu Xu;Houpeng Chen;Xing Ding;Xi Li;Zhitang Song","doi":"10.1109/TVLSI.2024.3430224","DOIUrl":"10.1109/TVLSI.2024.3430224","url":null,"abstract":"In this article, a low-cost quadruple-node-upsets resilient latch (LCQRL) design is proposed. To meet the high-reliability demands of safety-critical applications, the latch integrates nine soft-error-interceptive modules (SIMs) to form robust feedback loops, ensuring complete resilience to quadruple-node upsets (QNUs). Each Sim comprises ten CMOS transistors and a clocked inverter. Notably, C-element (CE) and dual interlocked storage cell (DICE) modules are not employed in this circuit, resulting in a small area and low power consumption. The simulation results verify the complete QNU self-recoverability and cost-effectiveness of this design. Compared with the existing radiation-hardened QNU resilient latches, the LCQRL latch demonstrates significant improvements in area, power consumption, and area-power–delay product (APDP) by 47.8%, 63%, and 75.5%, respectively. Furthermore, it exhibits low sensitivity to process, voltage, and temperature (PVT) variations.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 10","pages":"1930-1939"},"PeriodicalIF":2.8,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141867465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Electrical-Thermal Co-Simulation Model of Chiplet Heterogeneous Integration Systems 芯片组异构集成系统的电热协同仿真模型
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-29 DOI: 10.1109/TVLSI.2024.3430498
Xiaoning Ma;Qinzhi Xu;Chenghan Wang;He Cao;Jianyun Liu;Daoqing Zhang;Zhiqiang Li
Chiplet heterogeneous integration (CHI) is one of the important technology choices to continue Moore’s law. However, due to the characteristics of high power and low supply voltage in CHI systems, heavy currents need to flow through the power delivery network (PDN), and the Joule heating effect will result in the overall temperature increase of the CHI system. Meanwhile, the high temperature will cause the current as well as the performance of the system to degrade and a series of reliability problems will occur. In this article, an effective electrical-thermal coupling model is proposed to predict the steady-state temperature distribution of a 2.5-D CHI system considering the Joule heating effect and the temperature effect on the IR drop. The equivalent electrical conductivity model is also built up to describe the design features of the redistribution layer (RDL), bump, and through silicon via (TSV) structures based on the electrical-thermal duality. Furthermore, the governing equations for voltage distribution and temperature distribution are solved simultaneously by utilizing the finite volume method (FVM) with nonuniform mesh to realize the electrical-thermal co-simulation of the multiscale CHI system. The model application is further performed to investigate the influence of the model parameters on the voltage drop and temperature distribution of the CHI system. The verified systems and simulated results of the present investigation demonstrate the viability and accuracy of voltage and temperature field co-simulation and indicate that the new proposed electrical-thermal model is helpful in thermal and voltage drop analysis of packaging structures with the Joule heating effect and can be adopted to assist in the physical design optimization of 2.5-D CHI or 3-D heterogeneous stacked chips.
芯片异质集成(CHI)是延续摩尔定律的重要技术选择之一。然而,由于 CHI 系统具有高功率、低供电电压的特点,因此需要在功率传输网络(PDN)中流过大电流,焦耳热效应会导致 CHI 系统整体温度升高。同时,高温将导致电流和系统性能下降,并引发一系列可靠性问题。本文提出了一种有效的电热耦合模型,用于预测 2.5-D CHI 系统的稳态温度分布,其中考虑了焦耳热效应和温度对红外压降的影响。还建立了等效电导率模型,以描述基于电热二元性的再分布层 (RDL)、凸块和硅通孔 (TSV) 结构的设计特征。此外,利用非均匀网格有限体积法(FVM)同时求解了电压分布和温度分布的支配方程,实现了多尺度 CHI 系统的电热协同模拟。模型应用进一步研究了模型参数对 CHI 系统压降和温度分布的影响。本研究的验证系统和模拟结果证明了电压场和温度场协同模拟的可行性和准确性,并表明新提出的电热模型有助于对具有焦耳热效应的封装结构进行热分析和压降分析,并可用于辅助 2.5-D CHI 或 3-D 异构堆叠芯片的物理设计优化。
{"title":"An Electrical-Thermal Co-Simulation Model of Chiplet Heterogeneous Integration Systems","authors":"Xiaoning Ma;Qinzhi Xu;Chenghan Wang;He Cao;Jianyun Liu;Daoqing Zhang;Zhiqiang Li","doi":"10.1109/TVLSI.2024.3430498","DOIUrl":"10.1109/TVLSI.2024.3430498","url":null,"abstract":"Chiplet heterogeneous integration (CHI) is one of the important technology choices to continue Moore’s law. However, due to the characteristics of high power and low supply voltage in CHI systems, heavy currents need to flow through the power delivery network (PDN), and the Joule heating effect will result in the overall temperature increase of the CHI system. Meanwhile, the high temperature will cause the current as well as the performance of the system to degrade and a series of reliability problems will occur. In this article, an effective electrical-thermal coupling model is proposed to predict the steady-state temperature distribution of a 2.5-D CHI system considering the Joule heating effect and the temperature effect on the IR drop. The equivalent electrical conductivity model is also built up to describe the design features of the redistribution layer (RDL), bump, and through silicon via (TSV) structures based on the electrical-thermal duality. Furthermore, the governing equations for voltage distribution and temperature distribution are solved simultaneously by utilizing the finite volume method (FVM) with nonuniform mesh to realize the electrical-thermal co-simulation of the multiscale CHI system. The model application is further performed to investigate the influence of the model parameters on the voltage drop and temperature distribution of the CHI system. The verified systems and simulated results of the present investigation demonstrate the viability and accuracy of voltage and temperature field co-simulation and indicate that the new proposed electrical-thermal model is helpful in thermal and voltage drop analysis of packaging structures with the Joule heating effect and can be adopted to assist in the physical design optimization of 2.5-D CHI or 3-D heterogeneous stacked chips.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 10","pages":"1769-1781"},"PeriodicalIF":2.8,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141867469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 370-nW Bio-AFE With 2.9-$mu$Vrms Input Noise in an Octa-Channel System-in-Package for Multimode Bio-Signal Acquisition 用于多模生物信号采集的八通道系统级封装中的 370-nW Bio-AFE,输入噪声为 2.9 美元/毫微伏
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-29 DOI: 10.1109/tvlsi.2024.3430059
Patrick Fath, Harald Pretl
{"title":"A 370-nW Bio-AFE With 2.9-$mu$Vrms Input Noise in an Octa-Channel System-in-Package for Multimode Bio-Signal Acquisition","authors":"Patrick Fath, Harald Pretl","doi":"10.1109/tvlsi.2024.3430059","DOIUrl":"https://doi.org/10.1109/tvlsi.2024.3430059","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"51 1","pages":""},"PeriodicalIF":2.8,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141867467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1