IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

英文中文

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information 超大规模集成电路（VLSI）系统学报

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2026-02-25 DOI: 10.1109/TVLSI.2026.3660040

引用次数: 0

FPGA-Based Low-Power Signed Approximate Multipliers for Diverse Error-Resilient Applications 基于fpga的低功耗符号近似乘法器用于各种容错应用

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2026-01-23 DOI: 10.1109/TVLSI.2026.3654164

Yi Guo;Xuetao Li;Xin Luo;Heming Sun;Haroon Waris;Weiqiang Liu

The Booth algorithm is widely used for efficient signed multiplication due to its ability to reduce partial products. A higher radix Booth multiplier generates fewer partial products, while it also increases hardware complexity in the generator, diminishing the advantage of fewer accumulators. Previous optimizations of generators and accumulators were designed for application-specific integrated circuits (ASICs), but their performance gains cannot be comparably translated to field-programmable gate arrays (FPGAs) due to differences in architecture. This article proposes FPGA-friendly approximate Booth multipliers that combine approximate hybrid-radix partial product generation with resource-efficient accumulation techniques. Initially, to improve generation efficiency, an look-up table (LUT)-reused exact radix-8 generator is introduced through logical partitioning to integrate two types of partial products into a single LUT. In addition, approximate adjacent-compensation radix-8 and radix-16 generators are developed based on the Booth encoding bit-repetition principle. Later, to speed up partial product accumulation, an overlap-parallel accumulation scheme and various accumulators are proposed, reducing compression steps and enhancing resource utilization. Last, performance-configurable hybrid radix-8/-16 approximate Booth multipliers are designed to meet the needs of different error-resilient applications. The most hardware-efficient configuration of the proposed 16-bit multiplier reduces power–delay product (PDP) and LUT consumption by 38.31% and 35.66%, respectively, compared with the exact multiplier. Furthermore, the proposed designs offer a better balance between accuracy and hardware complexity than existing approximate multipliers. The practicality of these multipliers is demonstrated in both joint photographic experts group (JPEG) image compression and finite impulse response (FIR) filtering applications. An open-source library of the proposed multipliers is available at https://github.com/YnuGuoLab/FPGA_Signed_Approx_Mul to support further research.

Booth算法由于其减少部分乘积的能力而被广泛用于有效的符号乘法。更高基数的布斯乘法器产生更少的部分积，同时也增加了生成器的硬件复杂性，从而降低了较少累加器的优势。以前的发电机和蓄能器优化是为特定应用集成电路（asic）设计的，但由于架构的差异，它们的性能提升无法与现场可编程门阵列（fpga）相媲美。本文提出了fpga友好的近似布斯乘法器，它结合了近似混合基部分乘积生成和资源高效积累技术。最初，为了提高生成效率，通过逻辑分区引入了一个可重用的精确基数-8生成器，将两种类型的部分产品集成到一个LUT中。此外，基于Booth编码位重复原理，开发了近似邻接补偿基数8和基数16发生器。随后，为了加快部分产品积累，提出了重叠平行积累方案和多种积累器，减少了压缩步骤，提高了资源利用率。最后，设计了性能可配置的混合基数8/-16近似布斯乘法器，以满足不同容错应用的需求。与精确的乘法器相比，所提出的16位乘法器的硬件效率最高的配置分别将功率延迟产品（PDP）和LUT消耗降低了38.31%和35.66%。此外，与现有的近似乘法器相比，所提出的设计在精度和硬件复杂性之间提供了更好的平衡。这些乘法器的实用性在联合摄影专家组（JPEG）图像压缩和有限脉冲响应（FIR）滤波应用中得到了证明。建议的乘数的开源库可在https://github.com/YnuGuoLab/FPGA_Signed_Approx_Mul上获得，以支持进一步的研究。

{"title":"FPGA-Based Low-Power Signed Approximate Multipliers for Diverse Error-Resilient Applications","authors":"Yi Guo;Xuetao Li;Xin Luo;Heming Sun;Haroon Waris;Weiqiang Liu","doi":"10.1109/TVLSI.2026.3654164","DOIUrl":"https://doi.org/10.1109/TVLSI.2026.3654164","url":null,"abstract":"The Booth algorithm is widely used for efficient signed multiplication due to its ability to reduce partial products. A higher radix Booth multiplier generates fewer partial products, while it also increases hardware complexity in the generator, diminishing the advantage of fewer accumulators. Previous optimizations of generators and accumulators were designed for application-specific integrated circuits (ASICs), but their performance gains cannot be comparably translated to field-programmable gate arrays (FPGAs) due to differences in architecture. This article proposes FPGA-friendly approximate Booth multipliers that combine approximate hybrid-radix partial product generation with resource-efficient accumulation techniques. Initially, to improve generation efficiency, an look-up table (LUT)-reused exact radix-8 generator is introduced through logical partitioning to integrate two types of partial products into a single LUT. In addition, approximate adjacent-compensation radix-8 and radix-16 generators are developed based on the Booth encoding bit-repetition principle. Later, to speed up partial product accumulation, an overlap-parallel accumulation scheme and various accumulators are proposed, reducing compression steps and enhancing resource utilization. Last, performance-configurable hybrid radix-8/-16 approximate Booth multipliers are designed to meet the needs of different error-resilient applications. The most hardware-efficient configuration of the proposed 16-bit multiplier reduces power–delay product (PDP) and LUT consumption by 38.31% and 35.66%, respectively, compared with the exact multiplier. Furthermore, the proposed designs offer a better balance between accuracy and hardware complexity than existing approximate multipliers. The practicality of these multipliers is demonstrated in both joint photographic experts group (JPEG) image compression and finite impulse response (FIR) filtering applications. An open-source library of the proposed multipliers is available at <uri>https://github.com/YnuGuoLab/FPGA_Signed_Approx_Mul</uri> to support further research.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"34 3","pages":"1029-1042"},"PeriodicalIF":3.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147280886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information 超大规模集成电路（VLSI）系统学报

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2026-01-22 DOI: 10.1109/TVLSI.2026.3653075

引用次数: 0

A Low-Cost Local Masking Radix-4 NTT Against Soft-Analytical Side-Channel Attacks 抗软分析侧信道攻击的低成本局部掩蔽基数-4 NTT

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2026-01-15 DOI: 10.1109/TVLSI.2026.3651779

Congwei Chen;Jinwei Pu;Jianxiong Zhang;Jiaying Liao;Ruidian Zhan;Fei Yu;Yun Chen;Shuting Cai

The number theoretic transform (NTT) is essential for accelerating polynomial multiplication in lattice-based cryptography. However, it is vulnerable to soft-analytical side-channel attacks (SASCAs). Although local masking countermeasure provides theoretical resistance against such attacks, its direct implementation in Radix-4 NTT architecture leads to more than a 4 times increase in modular multiplications, resulting in substantial hardware overhead. To address this challenge, we propose the modular multiplication parallel mask sharing (MMPMS) scheme, which optimizes the modular multiplication parallelism of the Radix-4 butterfly units and shares random twiddle factors, thereby achieving a balance between hardware overhead and security. Then, we construct a complete local masking NTT/INTT algorithm and efficiently implement it on the Artix-7 field-programmable gate array (FPGA). Experimental results show that compared with the state-of-the-art local masking NTT, our scheme reduces the equivalent area and ATP overhead by more than 8.24 times and 6.74 times, respectively. In addition, a nonspecific t-test analysis indicates no significant side-channel leakage.

在基于格的密码系统中，数论变换（NTT）是加速多项式乘法的关键。然而，它很容易受到软分析侧信道攻击（SASCAs）。尽管局部掩蔽对策在理论上提供了对此类攻击的抵抗力，但其在Radix-4 NTT架构中的直接实现导致模块化乘法增加4倍以上，从而导致大量硬件开销。为了解决这一挑战，我们提出了模块化乘法并行掩码共享（MMPMS）方案，该方案优化了Radix-4蝴蝶单元的模块化乘法并行性，并共享随机旋转因子，从而实现了硬件开销和安全性之间的平衡。然后，我们构造了一个完整的局部掩蔽NTT/INTT算法，并在Artix-7现场可编程门阵列（FPGA）上高效实现。实验结果表明，与最先进的局部掩蔽NTT相比，我们的方案将等效面积和ATP开销分别减少了8.24倍和6.74倍以上。此外，非特异性t检验分析表明没有显著的侧通道泄漏。

{"title":"A Low-Cost Local Masking Radix-4 NTT Against Soft-Analytical Side-Channel Attacks","authors":"Congwei Chen;Jinwei Pu;Jianxiong Zhang;Jiaying Liao;Ruidian Zhan;Fei Yu;Yun Chen;Shuting Cai","doi":"10.1109/TVLSI.2026.3651779","DOIUrl":"https://doi.org/10.1109/TVLSI.2026.3651779","url":null,"abstract":"The number theoretic transform (NTT) is essential for accelerating polynomial multiplication in lattice-based cryptography. However, it is vulnerable to soft-analytical side-channel attacks (SASCAs). Although local masking countermeasure provides theoretical resistance against such attacks, its direct implementation in Radix-4 NTT architecture leads to more than a 4 times increase in modular multiplications, resulting in substantial hardware overhead. To address this challenge, we propose the modular multiplication parallel mask sharing (MMPMS) scheme, which optimizes the modular multiplication parallelism of the Radix-4 butterfly units and shares random twiddle factors, thereby achieving a balance between hardware overhead and security. Then, we construct a complete local masking NTT/INTT algorithm and efficiently implement it on the Artix-7 field-programmable gate array (FPGA). Experimental results show that compared with the state-of-the-art local masking NTT, our scheme reduces the equivalent area and ATP overhead by more than 8.24 times and 6.74 times, respectively. In addition, a nonspecific <italic>t-test analysis indicates no significant side-channel leakage.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"34 3","pages":"1062-1066"},"PeriodicalIF":3.1,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147280902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-Calibrating Analog Circuitry for Softmax-Scaled Function With Analog Computing-In-Memory 基于内存模拟计算的软最大尺度函数自校准模拟电路

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2026-01-13 DOI: 10.1109/TVLSI.2026.3651307

Linjun Jiang;Yitong Zhou;He Zhang;Wang Kang

Analog computing-in-memory (ACIM) has garnered widespread attention due to its advantage of high energy efficiency. However, it faces large power and hardware costs to handle sophisticated nonlinear functions, such as the softmax, due to costly exponentiation and division. Existing digital-domain approaches often rely on dedicated modules to carry out these operations, leading to a cost expensive area and high-power consumption. To address the issues, we propose a self-calibrating analog circuitry for a softmax-scaled function with ACIM. By exploiting transistor subthreshold properties, the work eliminates expensive digital operations while mapping exponentiation and division to successive analog circuits. A self-calibration module further mitigates partial mismatch-induced deviations by dynamically tuning bias voltages, improving overall fitting accuracy and system robustness. The proposed softmax-enabled ACIM work achieves energy efficiency of 55.06–60.08 TOPS/W and 684.15 GOPS/mm² at 4-bit precision. In comparison with the state-of-the-art ACIMs with softmax implications, our proposed work shows higher energy efficiency and area efficiency.

内存模拟计算（ACIM）由于其高能效的优点而受到广泛关注。然而，在处理复杂的非线性函数（如softmax）时，由于运算和除法成本高，它面临着巨大的功耗和硬件成本。现有的数字域方法通常依赖于专用模块来执行这些操作，导致成本昂贵的区域和高功耗。为了解决这些问题，我们提出了一种带有ACIM的软最大缩放功能的自校准模拟电路。通过利用晶体管的亚阈值特性，该工作消除了昂贵的数字操作，同时将幂运算和除法映射到连续的模拟电路。自校准模块通过动态调整偏置电压进一步减轻部分失配引起的偏差，提高整体拟合精度和系统鲁棒性。提出的基于softmax的ACIM工作在4位精度下的能效为55.06-60.08 TOPS/W和684.15 GOPS/mm2。与最先进的具有softmax含义的ACIMs相比，我们提出的工作显示出更高的能源效率和面积效率。

{"title":"Self-Calibrating Analog Circuitry for Softmax-Scaled Function With Analog Computing-In-Memory","authors":"Linjun Jiang;Yitong Zhou;He Zhang;Wang Kang","doi":"10.1109/TVLSI.2026.3651307","DOIUrl":"https://doi.org/10.1109/TVLSI.2026.3651307","url":null,"abstract":"Analog computing-in-memory (ACIM) has garnered widespread attention due to its advantage of high energy efficiency. However, it faces large power and hardware costs to handle sophisticated nonlinear functions, such as the softmax, due to costly exponentiation and division. Existing digital-domain approaches often rely on dedicated modules to carry out these operations, leading to a cost expensive area and high-power consumption. To address the issues, we propose a self-calibrating analog circuitry for a softmax-scaled function with ACIM. By exploiting transistor subthreshold properties, the work eliminates expensive digital operations while mapping exponentiation and division to successive analog circuits. A self-calibration module further mitigates partial mismatch-induced deviations by dynamically tuning bias voltages, improving overall fitting accuracy and system robustness. The proposed softmax-enabled ACIM work achieves energy efficiency of 55.06–60.08 TOPS/W and 684.15 GOPS/mm2 at 4-bit precision. In comparison with the state-of-the-art ACIMs with softmax implications, our proposed work shows higher energy efficiency and area efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"34 3","pages":"1067-1071"},"PeriodicalIF":3.1,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147280502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fault-Tolerant IDMA Multiuser Detector Based on Fault Injection Analysis of Internal Memories 基于内存故障注入分析的IDMA容错多用户检测器

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-12-29 DOI: 10.1109/TVLSI.2025.3646357

Byeong Yong Kong

In this article, a fault-tolerant architecture (FTA) is presented for the multiuser detector in interleave division multiple access (IDMA). The detector is inherently prone to soft errors, as its chip area is predominantly occupied by memories, which are easily exposed to high-energy particles and hostile interferences in harsh environments. One of the most widespread ways to protect memories is to encode their entries with error-correcting codes (ECCs). However, naïvely encoding all bits in an entry is likely to be costly and unnecessary. Accordingly, to sort out performance-critical bits and determine the priority of protection, we extensively scrutinize how vulnerable respective bits in the memories of the detector are too soft errors. Based on the analysis, in addition, an efficient FTA that selectively encodes only a subset of the bits in order of the identified vulnerability is developed. Furthermore, the proposed FTA implements the state-of-the-art multiuser detection (MUD) scheme called on-the-fly despreading (OD) and showcases a new feature named purification, which repeatedly replaces erroneous entries with corrected ones to keep them error-free. Complicated memory accesses to concurrently perform the OD as well as the purification are enabled by remodeling both the datapath and the control path of the baseline OD architecture (ODA). Implementation results demonstrate that, unlike the prior arts that fail to sustain near-optimal performances and become impractical even for a very low probability of soft error, the proposed FTA may operate robustly in a wide range of harsh conditions without incurring much overhead.

本文提出了交错分多址（IDMA）中多用户检测器的容错结构。探测器本身就容易出现软错误，因为它的芯片区域主要被存储器占据，在恶劣的环境中很容易暴露在高能粒子和敌对干扰中。保护记忆最普遍的方法之一是用纠错码（ECCs）对其条目进行编码。但是，naïvely对条目中的所有位进行编码可能是昂贵且不必要的。因此，为了对性能关键位进行分类并确定保护的优先级，我们广泛审查检测器存储器中各自的位有多容易受到软错误的影响。此外，在分析的基础上，开发了一种有效的FTA，该FTA仅按已识别漏洞的顺序选择性地编码一小部分比特。此外，拟议的FTA采用了最先进的多用户检测（MUD）方案，称为实时扩展（OD），并展示了一种名为净化的新功能，该功能可以用正确的条目反复替换错误条目，以保持它们无错误。通过重构基线OD架构（ODA）的数据路径和控制路径，可以实现并发执行OD和净化的复杂内存访问。实施结果表明，与无法维持接近最佳性能的现有技术不同，即使在非常低的软错误概率下也变得不切实际，所提出的FTA可以在广泛的恶劣条件下稳健运行，而不会产生太多开销。

{"title":"Fault-Tolerant IDMA Multiuser Detector Based on Fault Injection Analysis of Internal Memories","authors":"Byeong Yong Kong","doi":"10.1109/TVLSI.2025.3646357","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3646357","url":null,"abstract":"In this article, a fault-tolerant architecture (FTA) is presented for the multiuser detector in interleave division multiple access (IDMA). The detector is inherently prone to soft errors, as its chip area is predominantly occupied by memories, which are easily exposed to high-energy particles and hostile interferences in harsh environments. One of the most widespread ways to protect memories is to encode their entries with error-correcting codes (ECCs). However, naïvely encoding all bits in an entry is likely to be costly and unnecessary. Accordingly, to sort out performance-critical bits and determine the priority of protection, we extensively scrutinize how vulnerable respective bits in the memories of the detector are too soft errors. Based on the analysis, in addition, an efficient FTA that selectively encodes only a subset of the bits in order of the identified vulnerability is developed. Furthermore, the proposed FTA implements the state-of-the-art multiuser detection (MUD) scheme called on-the-fly despreading (OD) and showcases a new feature named purification, which repeatedly replaces erroneous entries with corrected ones to keep them error-free. Complicated memory accesses to concurrently perform the OD as well as the purification are enabled by remodeling both the datapath and the control path of the baseline OD architecture (ODA). Implementation results demonstrate that, unlike the prior arts that fail to sustain near-optimal performances and become impractical even for a very low probability of soft error, the proposed FTA may operate robustly in a wide range of harsh conditions without incurring much overhead.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"34 3","pages":"1004-1016"},"PeriodicalIF":3.1,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147280587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Low Area Built-In Self-Repair Using Hybrid Fault Address Memory for HBM 基于混合故障地址存储器的HBM低区域内置自修复

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-12-29 DOI: 10.1109/TVLSI.2025.3646232

Byungsoo Kim;Seung Ho Shin;Youngki Moon;Eugene Jeong;Sungho Kang

The massive computational requirements of large language model (LLMs) have increased the need for high-bandwidth memory (HBM), which involves high-volume data transfers. The high cell capacity of HBM results in extended test and repair times, leading to increased manufacturing costs. To reduce test time, a built- in self-repair (BISR) circuit, integrated into the HBM base die to detect and repair faults, tests multiple banks in parallel. Conventional BISR approaches adopt content-addressable memory (CAM) for fault classification to reduce repair time. However, dedicated CAM on each bank leads to substantial area overhead associated with its comparison logic. To address these issues, a novel BISR architecture that decouples fault classification and storage is proposed in this article. By introducing a linked CAM design with low area and sharing it across banks for fault classification, while small-area first-in first-out (FIFO) memories allocated to each bank store the classified fault information, the proposed architecture substantially reduces overall area overhead. Furthermore, the proposed architecture reorders the repair solution search sequence toward the most promising candidates by swapping fault entries during test idle periods, thereby significantly reducing repair time. Experimental results demonstrate that the proposed BISR architecture achieves low area overhead and fast repair time for high-density HBM.

大型语言模型（llm）的大量计算需求增加了对高带宽内存（HBM）的需求，这涉及到大容量的数据传输。HBM的高电池容量导致测试和维修时间延长，从而增加了制造成本。为了缩短测试时间，一个内置的自修复（BISR）电路，集成到HBM基模中检测和修复故障，并行测试多个组。传统的BISR方法采用内容可寻址存储器（CAM）进行故障分类，以减少故障修复时间。但是，每个银行上的专用CAM会导致与其比较逻辑相关的大量区域开销。为了解决这些问题，本文提出了一种将故障分类和存储解耦的新型BISR架构。通过引入低面积的链接CAM设计并在各银行之间共享其进行故障分类，而分配给每个银行的小面积先进先出（FIFO）存储器存储分类的故障信息，所提出的体系结构大大降低了总体面积开销。此外，该架构通过在测试空闲期间交换故障条目，将修复方案搜索序列重新排序到最有希望的候选方案，从而显着减少了修复时间。实验结果表明，所提出的BISR结构可以实现低面积开销和快速修复高密度HBM。

{"title":"A Low Area Built-In Self-Repair Using Hybrid Fault Address Memory for HBM","authors":"Byungsoo Kim;Seung Ho Shin;Youngki Moon;Eugene Jeong;Sungho Kang","doi":"10.1109/TVLSI.2025.3646232","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3646232","url":null,"abstract":"The massive computational requirements of large language model (LLMs) have increased the need for high-bandwidth memory (HBM), which involves high-volume data transfers. The high cell capacity of HBM results in extended test and repair times, leading to increased manufacturing costs. To reduce test time, a built- in self-repair (BISR) circuit, integrated into the HBM base die to detect and repair faults, tests multiple banks in parallel. Conventional BISR approaches adopt content-addressable memory (CAM) for fault classification to reduce repair time. However, dedicated CAM on each bank leads to substantial area overhead associated with its comparison logic. To address these issues, a novel BISR architecture that decouples fault classification and storage is proposed in this article. By introducing a linked CAM design with low area and sharing it across banks for fault classification, while small-area first-in first-out (FIFO) memories allocated to each bank store the classified fault information, the proposed architecture substantially reduces overall area overhead. Furthermore, the proposed architecture reorders the repair solution search sequence toward the most promising candidates by swapping fault entries during test idle periods, thereby significantly reducing repair time. Experimental results demonstrate that the proposed BISR architecture achieves low area overhead and fast repair time for high-density HBM.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"34 3","pages":"991-1003"},"PeriodicalIF":3.1,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147280531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information 超大规模集成电路（VLSI）系统学报

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-12-29 DOI: 10.1109/TVLSI.2025.3641351

引用次数: 0

A Capacitor Discharge-Based SRAM CIM Macro Based on Hybrid-Domain for Convolutional Neural Networks 基于卷积神经网络混合域的电容放电SRAM CIM宏

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-12-29 DOI: 10.1109/TVLSI.2025.3646936

Bin Qiang;Yiming Wei;Yongliang Zhou;Xiulong Wu;Chunyu Peng

Compute-in-memory (CIM) is increasingly recognized as an effective hardware accelerator for convolutional neural networks (CNNs). This work proposes a hybrid-domain CIM design using: 1) a multibit compute unit (MBCU) structure that realizes the multiplication operation of 2-bit input and 4-bit weight through the transistor-size-weighted capacitor discharge on the bitline; 2) a hybrid-domain quantization scheme (HDQS) of “time-domain + voltage-domain,” which integrates the high energy efficiency of time-domain quantization with the low-delay advantages of the voltage-domain quantization, and enhances the quantization accuracy through the combined effect of the process tracking module and the reference signal module; 3) the CIM circuit design, layout drawing and simulation verification of hybrid-domain static random access memory (SRAM) were realized by 28-nm CMOS technology, results show that the circuit supports 8-bit multiply–accumulate (MAC) operation, and full-precision quantization in the hybrid-domain form can achieve the optimal energy efficiency of 249.7 TOPS/W per bit at 0.7 V, and area efficiency of 4.29 TOPS/mm² per bit. Furthermore, the integration of the circuits with the VGG-16 network has been demonstrated to yield an inference accuracy of 90.52% in the CIFAR-10 dataset.

内存计算（CIM）作为卷积神经网络（cnn）的一种有效的硬件加速器越来越受到人们的认可。本文提出了一种混合域CIM设计方法：1)采用多比特计算单元（MBCU）结构，通过位线上晶体管大小的加权电容放电实现2位输入和4位权重的乘法运算；2)“时域+电压域”的混合域量化方案（HDQS），该方案将时域量化的高能效与电压域量化的低延迟优势相结合，并通过过程跟踪模块和参考信号模块的联合作用提高量化精度；3)采用28纳米CMOS技术实现了混合域静态随机存取存储器（SRAM）的CIM电路设计、版图绘制和仿真验证，结果表明，该电路支持8位乘法累加（MAC）运算，混合域形式的全精度量化在0.7 V时可达到249.7 TOPS/W / bit的最佳能效，面积效率为4.29 TOPS/mm2 / bit。此外，电路与VGG-16网络的集成已被证明在CIFAR-10数据集中产生90.52%的推理精度。

{"title":"A Capacitor Discharge-Based SRAM CIM Macro Based on Hybrid-Domain for Convolutional Neural Networks","authors":"Bin Qiang;Yiming Wei;Yongliang Zhou;Xiulong Wu;Chunyu Peng","doi":"10.1109/TVLSI.2025.3646936","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3646936","url":null,"abstract":"Compute-in-memory (CIM) is increasingly recognized as an effective hardware accelerator for convolutional neural networks (CNNs). This work proposes a hybrid-domain CIM design using: 1) a multibit compute unit (MBCU) structure that realizes the multiplication operation of 2-bit input and 4-bit weight through the transistor-size-weighted capacitor discharge on the bitline; 2) a hybrid-domain quantization scheme (HDQS) of “time-domain + voltage-domain,” which integrates the high energy efficiency of time-domain quantization with the low-delay advantages of the voltage-domain quantization, and enhances the quantization accuracy through the combined effect of the process tracking module and the reference signal module; 3) the CIM circuit design, layout drawing and simulation verification of hybrid-domain static random access memory (SRAM) were realized by 28-nm CMOS technology, results show that the circuit supports 8-bit multiply–accumulate (MAC) operation, and full-precision quantization in the hybrid-domain form can achieve the optimal energy efficiency of 249.7 TOPS/W per bit at 0.7 V, and area efficiency of 4.29 TOPS/mm2 per bit. Furthermore, the integration of the circuits with the VGG-16 network has been demonstrated to yield an inference accuracy of 90.52% in the CIFAR-10 dataset.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"34 3","pages":"1043-1047"},"PeriodicalIF":3.1,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147280893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DN-FF: A SEU-Tolerant Flip-Flop Design for Advanced Technology Nodes DN-FF：先进技术节点的容seu触发器设计

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2025-12-23 DOI: 10.1109/TVLSI.2025.3642611

Lowry P.-T. Wang;Charles H.-P. Wen

Single-event upsets (SEUs) pose a critical reliability threat in advanced automotive and space electronics. While existing SEU-tolerant latch designs, such as those based on C-elements and unique modules, often fail to meet the stringent space radiation standards (linear energy transfer (LET)

$= 60~text {MeV} cdot text {cm}^{2}$

/mg) at advanced fin field-effect transistor (FinFET) technology nodes, triple modular redundancy (TMR) achieves sufficient tolerance but incurs significant overhead. To address these limitations, this brief introduces DN-FF, a novel detection-node flip-flop (DN-FF) architecture that leverages reduced node spacing in modern processes for complete SEU immunity with significantly reduced overhead compared to TMR. Incorporating strategically placed detection nodes (DNs) and a dedicated detection circuit (DC), DN-FF achieves robust radiation-hardness while significantly reducing physical area, delay, and power consumption compared to traditional TMR-based solutions. The experimental results demonstrate that DN-FF reduces area by 8.2%, delay by 18.4%, and power by 15.9%, delivering a 37% improvement in the overall area–delay–power quality (ADPQ) metric. These advantages make DN-FF a compact, high-performance, and reliable solution for demanding automotive and aerospace applications.

单事件故障（seu）对先进汽车和航天电子设备的可靠性构成严重威胁。虽然现有的容限seu锁存器设计，如基于c -元件和独特模块的锁存器设计，通常无法满足先进的fin场效应晶体管（FinFET）技术节点严格的空间辐射标准（线性能量转移（LET） $= 60~text {MeV} cdot text {cm}^{2}$ /mg），但三模冗余（TMR）实现了足够的容限，但会产生显着的开销。为了解决这些限制，本文简要介绍了DN-FF，这是一种新型的检测节点触发器（DN-FF）架构，与TMR相比，它利用现代工艺中减少的节点间距来实现完全的SEU抗扰性，同时显著降低了开销。与传统的基于tmr的解决方案相比，DN-FF结合了战略性放置的检测节点（dn）和专用检测电路（DC），实现了强大的辐射硬度，同时显著减少了物理面积、延迟和功耗。实验结果表明，DN-FF减少了8.2%的面积，18.4%的延迟，15.9%的功率，提供了37%的整体面积-延迟-功率质量（ADPQ）指标的改进。这些优势使DN-FF成为一种紧凑、高性能、可靠的解决方案，适用于要求苛刻的汽车和航空航天应用。

{"title":"DN-FF: A SEU-Tolerant Flip-Flop Design for Advanced Technology Nodes","authors":"Lowry P.-T. Wang;Charles H.-P. Wen","doi":"10.1109/TVLSI.2025.3642611","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3642611","url":null,"abstract":"Single-event upsets (SEUs) pose a critical reliability threat in advanced automotive and space electronics. While existing SEU-tolerant latch designs, such as those based on C-elements and unique modules, often fail to meet the stringent space radiation standards (linear energy transfer (LET) <inline-formula> <tex-math>$= 60~text {MeV} cdot text {cm}^{2}$ </tex-math></inline-formula>/mg) at advanced fin field-effect transistor (FinFET) technology nodes, triple modular redundancy (TMR) achieves sufficient tolerance but incurs significant overhead. To address these limitations, this brief introduces DN-FF, a novel detection-node flip-flop (DN-FF) architecture that leverages reduced node spacing in modern processes for complete SEU immunity with significantly reduced overhead compared to TMR. Incorporating strategically placed detection nodes (DNs) and a dedicated detection circuit (DC), DN-FF achieves robust radiation-hardness while significantly reducing physical area, delay, and power consumption compared to traditional TMR-based solutions. The experimental results demonstrate that DN-FF reduces area by 8.2%, delay by 18.4%, and power by 15.9%, delivering a 37% improvement in the overall area–delay–power quality (ADPQ) metric. These advantages make DN-FF a compact, high-performance, and reliable solution for demanding automotive and aerospace applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"34 3","pages":"1048-1052"},"PeriodicalIF":3.1,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147280522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀