IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems最新文献_第2页

FPGA-Based Hardware Accelerator of zk-SNARK 基于fpga的zk-SNARK硬件加速器

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2025-07-24 DOI: 10.1109/TCAD.2025.3592608

Baoze Zhao;Conghui Luo;Wenjin Huang;Yihua Huang

zero-knowledge proof (ZKP) has gained widespread application across various domains, demonstrating remarkable success. Among ZKP algorithms, zero-knowledge succinct noninteractive argument of knowledge (zk-SNARK) is the most widely used. However, despite its advantages of small proof size and succinct verification, zk-SNARK proof generation faces significant challenges due to high computational demands, limiting its practical application. This article addresses these challenges by accelerating two computationally intensive operations in zk-SNARK proof generation, number theory transformation (NTT) and multiscalar multiplication (MSM), using FPGAs. In the implementation of NTT hardware accelerators for zk-SNARK applications, the traditional 4-step algorithm often encounters conflicts between off-chip bandwidth and on-chip memory. To resolve this issue, we propose an innovative approach that enhances accelerator performance by recursively applying the 4-step algorithm to create a more efficient 6-step algorithm. For MSM hardware acceleration on FPGAs, existing works are often constrained by limited on-chip memory, restricting the use of longer slice lengths, which are crucial for higher performance when using the commenly used Pippenger algorithm. To overcome this limitation, we introduce the Batch Method, optimizing off-chip memory consumption, enabling the accelerator to use longer slice lengths and achieve superior performance. Experimental results demonstrate that the proposed NTT design achieves

$1.76times $

higher DSP efficiency than the SAM. Meanwhile, the proposed MSM design demonstrates

$1.24times $

higher performance than the MSMAC with aligned frequency and number of PEs. When benchmarked against the GPU implementation GZKP, our MSM design exhibits

$1.16times $

and

$1.46times $

higher performance than GZKP for BLS12-381 and BN-254, respectively. However, the NTT design remains at a disadvantage due to the bandwidth limitation between our platform, Xilinx Alveo U250, and GZKP’s platforms, Nvidia GTX 1080 Ti and Nvidia Tesla V100.

零知识证明（ZKP）在各个领域得到了广泛的应用，取得了显著的成功。在ZKP算法中，零知识简捷非交互知识论证（zk-SNARK）应用最为广泛。然而，尽管具有证明尺寸小、验证简洁等优点，但由于计算量大，zk-SNARK证明生成面临重大挑战，限制了其实际应用。本文通过使用fpga加速zk-SNARK证明生成中的两个计算密集型操作，即数论变换（NTT）和多标量乘法（MSM）来解决这些挑战。在为zk-SNARK应用实现NTT硬件加速器时，传统的4步算法经常会遇到片外带宽和片内内存的冲突。为了解决这个问题，我们提出了一种创新的方法，通过递归地应用4步算法来创建更高效的6步算法，从而提高加速器的性能。对于fpga上的MSM硬件加速，现有的工作通常受到片上内存有限的限制，限制了更长的切片长度的使用，这对于使用常用的Pippenger算法时提高性能至关重要。为了克服这一限制，我们引入了批处理方法，优化片外内存消耗，使加速器能够使用更长的片长度并获得卓越的性能。实验结果表明，该NTT设计的DSP效率是SAM的1.76倍。同时，所提出的MSM设计的性能比频率和pe数对齐的MSMAC高1.24倍。当对GPU实现GZKP进行基准测试时，我们的MSM设计的性能分别比BLS12-381和BN-254的GZKP高1.16倍和1.46倍。然而，由于我们的平台Xilinx Alveo U250和GZKP的平台Nvidia GTX 1080 Ti和Nvidia Tesla V100之间的带宽限制，NTT的设计仍然处于劣势。

{"title":"FPGA-Based Hardware Accelerator of zk-SNARK","authors":"Baoze Zhao;Conghui Luo;Wenjin Huang;Yihua Huang","doi":"10.1109/TCAD.2025.3592608","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3592608","url":null,"abstract":"zero-knowledge proof (ZKP) has gained widespread application across various domains, demonstrating remarkable success. Among ZKP algorithms, zero-knowledge succinct noninteractive argument of knowledge (zk-SNARK) is the most widely used. However, despite its advantages of small proof size and succinct verification, zk-SNARK proof generation faces significant challenges due to high computational demands, limiting its practical application. This article addresses these challenges by accelerating two computationally intensive operations in zk-SNARK proof generation, number theory transformation (NTT) and multiscalar multiplication (MSM), using FPGAs. In the implementation of NTT hardware accelerators for zk-SNARK applications, the traditional 4-step algorithm often encounters conflicts between off-chip bandwidth and on-chip memory. To resolve this issue, we propose an innovative approach that enhances accelerator performance by recursively applying the 4-step algorithm to create a more efficient 6-step algorithm. For MSM hardware acceleration on FPGAs, existing works are often constrained by limited on-chip memory, restricting the use of longer slice lengths, which are crucial for higher performance when using the commenly used Pippenger algorithm. To overcome this limitation, we introduce the Batch Method, optimizing off-chip memory consumption, enabling the accelerator to use longer slice lengths and achieve superior performance. Experimental results demonstrate that the proposed NTT design achieves <inline-formula> <tex-math>$1.76times $ </tex-math></inline-formula> higher DSP efficiency than the SAM. Meanwhile, the proposed MSM design demonstrates <inline-formula> <tex-math>$1.24times $ </tex-math></inline-formula> higher performance than the MSMAC with aligned frequency and number of PEs. When benchmarked against the GPU implementation GZKP, our MSM design exhibits <inline-formula> <tex-math>$1.16times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.46times $ </tex-math></inline-formula> higher performance than GZKP for BLS12-381 and BN-254, respectively. However, the NTT design remains at a disadvantage due to the bandwidth limitation between our platform, Xilinx Alveo U250, and GZKP’s platforms, Nvidia GTX 1080 Ti and Nvidia Tesla V100.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"887-900"},"PeriodicalIF":2.9,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computational Performance Bounds Prediction in Quantum Computing With Unstable Noise 具有不稳定噪声的量子计算性能边界预测

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2025-07-24 DOI: 10.1109/TCAD.2025.3592605

Jinyang Li;Samudra Dasgupta;Yuhong Song;Lei Yang;Travis Humble;Weiwen Jiang

Quantum computing has significantly advanced in recent years, boasting devices with hundreds of quantum bits (qubits), hinting at its potential quantum advantage over classical computing. Yet, noise in quantum devices poses significant barriers to realizing this supremacy. Understanding noise’s impact is crucial for reproducibility and application reuse; moreover, the next-generation quantum-centric supercomputing essentially requires efficient and accurate noise characterization to support system management (e.g., job scheduling), where ensuring correct functional performance (i.e., fidelity) of jobs on available quantum devices can even be higher-priority than traditional objectives. However, noise fluctuates over time, even on the same quantum device, which makes predicting the computational bounds for on-the-fly noise is vital. Noisy quantum simulation can offer insights but faces efficiency and scalability issues. In this work, we propose a data-driven workflow, namely, QuBound, to predict computational performance bounds. It decomposes historical performance traces to isolate noise sources and devises a novel encoder to embed circuit and noise information processed by a long short-term memory (LSTM) network. For evaluation, we compare QuBound with a state-of-the-art learning-based predictor, which only generates a single performance value instead of a bound. Experimental results show that the result of the existing approach falls outside of performance bounds, while all predictions from our QuBound with the assistance of performance decomposition better fit the bounds. Moreover, QuBound can efficiently produce practical bounds for various circuits with over

$10^{6}$

speedup over simulation; in addition, the range from QuBound is over

$10times $

narrower than the state-of-the-art analytical approach.

近年来，量子计算取得了显着进步，拥有数百个量子比特（量子位）的设备，暗示其比经典计算具有潜在的量子优势。然而，量子器件中的噪声对实现这一优势构成了重大障碍。理解噪声的影响对于再现性和应用程序重用至关重要；此外，下一代以量子为中心的超级计算本质上需要高效和准确的噪声特性来支持系统管理（例如，作业调度），其中确保可用量子设备上作业的正确功能性能（即保真度）甚至比传统目标更高优先级。然而，即使在相同的量子设备上，噪声也会随时间波动，这使得预测动态噪声的计算界限至关重要。噪声量子模拟可以提供见解，但面临效率和可扩展性问题。在这项工作中，我们提出了一个数据驱动的工作流，即QuBound，来预测计算性能界限。通过对历史性能轨迹的分解来隔离噪声源，并设计了一种新颖的编码器嵌入电路，噪声信息由长短期记忆网络处理。为了评估，我们将QuBound与最先进的基于学习的预测器进行比较，后者只生成单个性能值，而不是边界。实验结果表明，现有方法的结果落在性能界限之外，而我们的QuBound在性能分解的帮助下的所有预测都更好地贴合界限。此外，QuBound可以有效地为各种电路产生实际边界，比仿真加速超过$10^{6}$；此外，QuBound的范围比最先进的分析方法窄10倍以上。

{"title":"Computational Performance Bounds Prediction in Quantum Computing With Unstable Noise","authors":"Jinyang Li;Samudra Dasgupta;Yuhong Song;Lei Yang;Travis Humble;Weiwen Jiang","doi":"10.1109/TCAD.2025.3592605","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3592605","url":null,"abstract":"Quantum computing has significantly advanced in recent years, boasting devices with hundreds of quantum bits (qubits), hinting at its potential quantum advantage over classical computing. Yet, noise in quantum devices poses significant barriers to realizing this supremacy. Understanding noise’s impact is crucial for reproducibility and application reuse; moreover, the next-generation quantum-centric supercomputing essentially requires efficient and accurate noise characterization to support system management (e.g., job scheduling), where ensuring correct functional performance (i.e., fidelity) of jobs on available quantum devices can even be higher-priority than traditional objectives. However, noise fluctuates over time, even on the same quantum device, which makes predicting the computational bounds for on-the-fly noise is vital. Noisy quantum simulation can offer insights but faces efficiency and scalability issues. In this work, we propose a data-driven workflow, namely, QuBound, to predict computational performance bounds. It decomposes historical performance traces to isolate noise sources and devises a novel encoder to embed circuit and noise information processed by a long short-term memory (LSTM) network. For evaluation, we compare QuBound with a state-of-the-art learning-based predictor, which only generates a single performance value instead of a bound. Experimental results show that the result of the existing approach falls outside of performance bounds, while all predictions from our QuBound with the assistance of performance decomposition better fit the bounds. Moreover, QuBound can efficiently produce practical bounds for various circuits with over <inline-formula> <tex-math>$10^{6}$ </tex-math></inline-formula> speedup over simulation; in addition, the range from QuBound is over <inline-formula> <tex-math>$10times $ </tex-math></inline-formula> narrower than the state-of-the-art analytical approach.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"969-982"},"PeriodicalIF":2.9,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Partial Recomputation Fault Detection Architecture for Multiple-Precision Montgomery Modular Multiplication 多精度Montgomery模乘法的部分重计算故障检测体系

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2025-07-24 DOI: 10.1109/TCAD.2025.3592590

Saeed Aghapour;Kasra Ahmadi;Mehran Mozaffari Kermani;Reza Azarderakhsh

Detection of soft errors and faults are one of the most critical factors in ensuring the reliability of algorithm implementations. Multiplication, as a fundamental and computationally intensive operation, is particularly vulnerable to such errors. Given its widespread use in cryptography and coding applications, detecting these errors is crucial. For example, in hash functions, even a single-bit change in the input can completely alter the output ([ideally, each bit of the output changes with a probability of

${}({1}/{2}])$

. Montgomery multiplication as an efficient multiplication method is an integral part of numerous cryptographic applications expanding both classical and post quantum cryptography. For that reason, this brief introduces a fault detection method for the multiple-precision Montgomery modular multiplication algorithm based on partial recomputation. Through extensive simulations and implementations, we demonstrate that our approach efficiently detects both permanent and transient errors with a high-success rate, while imposing modest area and time overhead on the system.

软错误和故障的检测是保证算法实现可靠性的关键因素之一。乘法作为一种基本的计算密集型操作，特别容易受到这类错误的影响。鉴于其在密码学和编码应用中的广泛使用，检测这些错误至关重要。例如，在哈希函数中，即使输入中有一个比特的变化也可以完全改变输出（[理想情况下，输出的每个比特的变化概率为${}({1}/{2}])$）。蒙哥马利乘法作为一种高效的乘法方法，是扩展经典和后量子密码学的众多密码学应用中不可或缺的一部分。为此，本文简要介绍了一种基于部分重计算的多精度Montgomery模乘法算法的故障检测方法。通过大量的模拟和实现，我们证明了我们的方法能够以高成功率有效地检测永久和瞬态错误，同时在系统上施加适度的面积和时间开销。

{"title":"Partial Recomputation Fault Detection Architecture for Multiple-Precision Montgomery Modular Multiplication","authors":"Saeed Aghapour;Kasra Ahmadi;Mehran Mozaffari Kermani;Reza Azarderakhsh","doi":"10.1109/TCAD.2025.3592590","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3592590","url":null,"abstract":"Detection of soft errors and faults are one of the most critical factors in ensuring the reliability of algorithm implementations. Multiplication, as a fundamental and computationally intensive operation, is particularly vulnerable to such errors. Given its widespread use in cryptography and coding applications, detecting these errors is crucial. For example, in hash functions, even a single-bit change in the input can completely alter the output ([ideally, each bit of the output changes with a probability of <inline-formula> <tex-math>${}({1}/{2}])$ </tex-math></inline-formula>. Montgomery multiplication as an efficient multiplication method is an integral part of numerous cryptographic applications expanding both classical and post quantum cryptography. For that reason, this brief introduces a fault detection method for the multiple-precision Montgomery modular multiplication algorithm based on partial recomputation. Through extensive simulations and implementations, we demonstrate that our approach efficiently detects both permanent and transient errors with a high-success rate, while imposing modest area and time overhead on the system.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"1042-1046"},"PeriodicalIF":2.9,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Memristive Neural Network Circuit Implementation of Model Predictive Control for Trajectory Tracking 轨迹跟踪模型预测控制的记忆神经网络电路实现

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2025-07-24 DOI: 10.1109/TCAD.2025.3592587

Pingdan Xiao;Yiliu Gu;Haoyou Jiang;Zhen Huan;Sichun Du;Qinghui Hong

model predictive control (MPC), a receding-horizon optimal control strategy, predicts system dynamics and optimizes control actions to satisfy performance and constraint requirements, making it widely adopted in control engineering. However, contemporary computing platforms struggle to meet the real-time and energy-efficient demands of MPC’s computationally intensive matrix operations, stemming from high data movement overhead, extensive circuit resource utilization, and frequent data conversions inherent in physical system interfaces. These challenges collectively impose significant latency and power penalties, particularly critical as systems grow in complexity and scale within the big-data era. This article introduces a zeroing neural network (ZNN)-based memristive neural network circuit that directly converges the MPC error function to zero in one step. Theoretical analysis and simulations validate the closed-loop circuit’s stability. For a 32-step prediction horizon, evaluations show that the control output from the proposed circuit matches the ideal digital MPC solution with 96.0% accuracy. The circuit also executes at least an order of magnitude faster and consumes less energy than traditional MPC solvers. Additionally, the circuit successfully accelerates the proposed trajectory tracking algorithm, achieving 98.0% accuracy compared with the theoretical result and

$318.2times $

improvement in computation time compared to CPU.

模型预测控制（MPC）是一种后退水平最优控制策略，它通过预测系统动态并优化控制动作来满足性能和约束要求，在控制工程中得到了广泛的应用。然而，由于高数据移动开销、广泛的电路资源利用以及物理系统接口中固有的频繁数据转换，当代计算平台难以满足MPC计算密集型矩阵操作的实时性和节能需求。这些挑战共同造成了严重的延迟和功耗损失，特别是在大数据时代，随着系统的复杂性和规模的增长，这一点尤为重要。本文介绍了一种基于归零神经网络（ZNN）的记忆神经网络电路，该电路可直接将MPC误差函数一步收敛到零。理论分析和仿真验证了闭环电路的稳定性。对于32步预测范围，评估表明，该电路的控制输出与理想的数字MPC解决方案匹配，准确率为96.0%。与传统的MPC求解器相比，该电路的执行速度至少快了一个数量级，消耗的能量也更少。此外，电路成功地加速了所提出的轨迹跟踪算法，与理论结果相比，准确率达到98.0%，与CPU相比，计算时间提高了318.2倍。

{"title":"Memristive Neural Network Circuit Implementation of Model Predictive Control for Trajectory Tracking","authors":"Pingdan Xiao;Yiliu Gu;Haoyou Jiang;Zhen Huan;Sichun Du;Qinghui Hong","doi":"10.1109/TCAD.2025.3592587","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3592587","url":null,"abstract":"model predictive control (MPC), a receding-horizon optimal control strategy, predicts system dynamics and optimizes control actions to satisfy performance and constraint requirements, making it widely adopted in control engineering. However, contemporary computing platforms struggle to meet the real-time and energy-efficient demands of MPC’s computationally intensive matrix operations, stemming from high data movement overhead, extensive circuit resource utilization, and frequent data conversions inherent in physical system interfaces. These challenges collectively impose significant latency and power penalties, particularly critical as systems grow in complexity and scale within the big-data era. This article introduces a zeroing neural network (ZNN)-based memristive neural network circuit that directly converges the MPC error function to zero in one step. Theoretical analysis and simulations validate the closed-loop circuit’s stability. For a 32-step prediction horizon, evaluations show that the control output from the proposed circuit matches the ideal digital MPC solution with 96.0% accuracy. The circuit also executes at least an order of magnitude faster and consumes less energy than traditional MPC solvers. Additionally, the circuit successfully accelerates the proposed trajectory tracking algorithm, achieving 98.0% accuracy compared with the theoretical result and <inline-formula> <tex-math>$318.2times $ </tex-math></inline-formula> improvement in computation time compared to CPU.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"955-968"},"PeriodicalIF":2.9,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FPGA-Friendly Architecture of Processing Elements for Efficient and Accurate Quantized CNNs 高效精确量化cnn处理单元的fpga友好架构

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2025-07-22 DOI: 10.1109/TCAD.2025.3591734

Botao Xiong;Shize Zhang;Xingyu Shao;Xintong He;Yuchun Chang

An field-programmable gate array (FPGA)-friendly processing element (PE) based on the small logarithmic floating point (SLFP) format is proposed. The proposed PEs not only support inner product but also perform various nonlinear activation functions (NAFs), which consume

$674 times $

LUT6s and

$7 times $

digital signal processing blocks (DSPs) and operate at 450MHz in a pipeline manner for Zynq-7000. In addition, as the distribution of SLFP numbers is not uniform, this brief revises the weight decay scheme in the quantization aware training process to explore the optimum quantized weights. Compared with INT8-based design, the proposed method balances the resource usage between look-up tables and DSPs. The accuracy loss of the quantized model based on the 8-bit SLFP is also small due to the high dynamic range of SLFP format. Moreover, since the proposed method can support different NAFs, this brief improves the quantized model accuracy by selecting an appropriate NAF from Swish, GELU, Mish and PReLU. Compared to the baseline (parameters are FP32, NAF is ReLU), the accuracy of quantized ResNet-50 and MobileNet is increased by 2.65% and −0.33%.

提出一种基于小对数浮点（SLFP）格式的现场可编程门阵列（FPGA）友好处理单元（PE）。所提出的pe不仅支持内积，而且还执行各种非线性激活函数（NAFs），这些函数消耗674美元的lut6和7美元的数字信号处理块（dsp），并以流水线方式在Zynq-7000的450MHz下工作。此外，由于SLFP数的分布不均匀，本文简要修改了量化感知训练过程中的权值衰减方案，以探索最优的量化权值。与基于int8的设计相比，该方法平衡了查找表和dsp之间的资源使用。由于SLFP格式的高动态范围，基于8位SLFP的量化模型的精度损失也很小。此外，由于所提出的方法可以支持不同的NAF，本文通过从Swish， GELU， Mish和PReLU中选择合适的NAF来提高量化模型的精度。与基线（参数为FP32， NAF为ReLU）相比，量化后的ResNet-50和MobileNet的准确率分别提高了2.65%和- 0.33%。

{"title":"FPGA-Friendly Architecture of Processing Elements for Efficient and Accurate Quantized CNNs","authors":"Botao Xiong;Shize Zhang;Xingyu Shao;Xintong He;Yuchun Chang","doi":"10.1109/TCAD.2025.3591734","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3591734","url":null,"abstract":"An field-programmable gate array (FPGA)-friendly processing element (PE) based on the small logarithmic floating point (SLFP) format is proposed. The proposed PEs not only support inner product but also perform various nonlinear activation functions (NAFs), which consume <inline-formula> <tex-math>$674 times $ </tex-math></inline-formula> LUT6s and <inline-formula> <tex-math>$7 times $ </tex-math></inline-formula> digital signal processing blocks (DSPs) and operate at 450MHz in a pipeline manner for Zynq-7000. In addition, as the distribution of SLFP numbers is not uniform, this brief revises the weight decay scheme in the quantization aware training process to explore the optimum quantized weights. Compared with INT8-based design, the proposed method balances the resource usage between look-up tables and DSPs. The accuracy loss of the quantized model based on the 8-bit SLFP is also small due to the high dynamic range of SLFP format. Moreover, since the proposed method can support different NAFs, this brief improves the quantized model accuracy by selecting an appropriate NAF from Swish, GELU, Mish and PReLU. Compared to the baseline (parameters are FP32, NAF is ReLU), the accuracy of quantized ResNet-50 and MobileNet is increased by 2.65% and −0.33%.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"882-886"},"PeriodicalIF":2.9,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NeFT: Negative Feedback Training to Improve Robustness of Compute-in-Memory DNN Accelerators 负反馈训练提高内存中计算DNN加速器的鲁棒性

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2025-07-21 DOI: 10.1109/TCAD.2025.3591409

Yifan Qin;Zheyu Yan;Dailin Gan;Jun Xia;Zixuan Pan;Wujie Wen;Xiaobo Sharon Hu;Yiyu Shi

Compute-in-memory accelerators built upon nonvolatile memory devices excel in energy efficiency and latency when performing deep neural network (DNN) inference, thanks to their in-situ data processing capability. However, the stochastic nature and intrinsic variations of nonvolatile memory devices often result in performance degradation during DNN inference. Introducing these nonideal device behaviors in DNN training enhances robustness, but drawbacks include limited accuracy improvement, reduced prediction confidence, and convergence issues. This arises from a mismatch between the deterministic training and nondeterministic device variations, as such training, though considering variations, relies solely on the model’s final output. In this work, inspired by control theory, we propose negative feedback training (NeFT)—a novel concept supported by theoretical analysis—to more effectively capture the multiscale noisy information throughout the network. We instantiate this concept with two specific instances, oriented variational forward (OVF) and intermediate representation snapshot (IRS). Based on device variation models extracted from measured data, extensive experiments show that our NeFT outperforms existing state-of-the-art methods with up to a 45.08% improvement in inference accuracy while reducing epistemic uncertainty, boosting output confidence, and improving convergence probability. These results underline the generality and practicality of our NeFT framework for increasing the robustness of DNNs against device variations. The source code for these two instances is available at https://github.com/YifanQin-ND/NeFT_CIM.

基于非易失性存储设备的内存计算加速器在执行深度神经网络（DNN）推理时，由于其原位数据处理能力，在能效和延迟方面表现出色。然而，非易失性存储设备的随机性和内在变化往往导致DNN推理过程中的性能下降。在DNN训练中引入这些非理想设备行为可以增强鲁棒性，但缺点包括精度提高有限、预测置信度降低和收敛问题。这是由确定性训练和不确定性设备变化之间的不匹配引起的，因为这种训练虽然考虑了变化，但完全依赖于模型的最终输出。在这项工作中，受控制理论的启发，我们提出了负反馈训练（NeFT）——一个由理论分析支持的新概念——以更有效地捕获整个网络中的多尺度噪声信息。我们用两个特定的实例来实例化这个概念：面向变分前向（OVF）和中间表示快照（IRS）。基于从测量数据中提取的设备变化模型，大量的实验表明，我们的NeFT优于现有的最先进的方法，在减少认知不确定性、提高输出置信度和提高收敛概率的同时，推理准确率提高了45.08%。这些结果强调了我们的NeFT框架在增加dnn对设备变化的鲁棒性方面的通用性和实用性。这两个实例的源代码可从https://github.com/YifanQin-ND/NeFT_CIM获得。

{"title":"NeFT: Negative Feedback Training to Improve Robustness of Compute-in-Memory DNN Accelerators","authors":"Yifan Qin;Zheyu Yan;Dailin Gan;Jun Xia;Zixuan Pan;Wujie Wen;Xiaobo Sharon Hu;Yiyu Shi","doi":"10.1109/TCAD.2025.3591409","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3591409","url":null,"abstract":"Compute-in-memory accelerators built upon nonvolatile memory devices excel in energy efficiency and latency when performing deep neural network (DNN) inference, thanks to their in-situ data processing capability. However, the stochastic nature and intrinsic variations of nonvolatile memory devices often result in performance degradation during DNN inference. Introducing these nonideal device behaviors in DNN training enhances robustness, but drawbacks include limited accuracy improvement, reduced prediction confidence, and convergence issues. This arises from a mismatch between the deterministic training and nondeterministic device variations, as such training, though considering variations, relies solely on the model’s final output. In this work, inspired by control theory, we propose negative feedback training (NeFT)—a novel concept supported by theoretical analysis—to more effectively capture the multiscale noisy information throughout the network. We instantiate this concept with two specific instances, oriented variational forward (OVF) and intermediate representation snapshot (IRS). Based on device variation models extracted from measured data, extensive experiments show that our NeFT outperforms existing state-of-the-art methods with up to a 45.08% improvement in inference accuracy while reducing epistemic uncertainty, boosting output confidence, and improving convergence probability. These results underline the generality and practicality of our NeFT framework for increasing the robustness of DNNs against device variations. The source code for these two instances is available at <uri>https://github.com/YifanQin-ND/NeFT_CIM</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"983-997"},"PeriodicalIF":2.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PINDAS: Physics-Informed Decoupled Spatiotemporal Artificial Neural Network for Dynamic Thermal Simulation 动态热模拟的物理信息解耦时空人工神经网络

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2025-07-21 DOI: 10.1109/TCAD.2025.3591411

David Coenen;Herman Oprins

Multiscale thermal analysis in integrated circuits is required for capturing both device-level and package-level dynamics. Traditional analysis with the finite element (FE) method performs poor at multiscale tasks because of conflicting element size requirements and CPU time limitation. Machine learning (ML) algorithms can be trained with FE simulation data to perform fast and efficient temperature prediction. In this work, spatial and temporal aspects of the temperature field are treated independently and used to train two artificial neural networks (ANNs). Prior to ANN training, fundamental spatial modes [proper orthogonal decomposition (POD)] are calculated to simplify the ANN structure. In the time domain, a similar approach is used: the fundamental temporal modes, i.e., thermal step responses, are calculated and used to train the ANN. By training the ANN on step response data, the final dynamic temperature profile can be reconstructed using the convolutional operator. Using this method, a physics-informed ML workflow is established as the step response is converted to the impulse response or Green’s function, which are a known part of the analytical solution to the heat equation. The final result is an extremely fast and accurate dynamic thermal model of a chip.

集成电路中的多尺度热分析是捕获器件级和封装级动态的必要条件。传统的有限元分析方法由于单元尺寸要求的冲突和CPU时间的限制，在多尺度任务中表现不佳。机器学习（ML）算法可以用FE模拟数据进行训练，以执行快速有效的温度预测。在这项工作中，温度场的空间和时间方面被独立处理，并用于训练两个人工神经网络（ann）。在神经网络训练之前，计算基本空间模式[适当正交分解（POD）]以简化神经网络结构。在时域中，使用了类似的方法：计算基本时间模式，即热阶跃响应，并用于训练人工神经网络。通过对阶跃响应数据进行训练，利用卷积算子重构出最终的动态温度曲线。使用这种方法，将阶跃响应转换为脉冲响应或格林函数（这是热方程解析解的已知部分）时，建立了一个物理知情的ML工作流。最终的结果是一个非常快速和准确的芯片动态热模型。

{"title":"PINDAS: Physics-Informed Decoupled Spatiotemporal Artificial Neural Network for Dynamic Thermal Simulation","authors":"David Coenen;Herman Oprins","doi":"10.1109/TCAD.2025.3591411","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3591411","url":null,"abstract":"Multiscale thermal analysis in integrated circuits is required for capturing both device-level and package-level dynamics. Traditional analysis with the finite element (FE) method performs poor at multiscale tasks because of conflicting element size requirements and CPU time limitation. Machine learning (ML) algorithms can be trained with FE simulation data to perform fast and efficient temperature prediction. In this work, spatial and temporal aspects of the temperature field are treated independently and used to train two artificial neural networks (ANNs). Prior to ANN training, fundamental spatial modes [proper orthogonal decomposition (POD)] are calculated to simplify the ANN structure. In the time domain, a similar approach is used: the fundamental temporal modes, i.e., thermal step responses, are calculated and used to train the ANN. By training the ANN on step response data, the final dynamic temperature profile can be reconstructed using the convolutional operator. Using this method, a physics-informed ML workflow is established as the step response is converted to the impulse response or Green’s function, which are a known part of the analytical solution to the heat equation. The final result is an extremely fast and accurate dynamic thermal model of a chip.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"998-1006"},"PeriodicalIF":2.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A ReRAM-Based Processing-in-Memory Framework for LSM-Based Key-Value Store 基于lsm的键值存储中基于reram的内存处理框架

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2025-07-21 DOI: 10.1109/TCAD.2025.3591410

Zehao Chen;Kai Zhang;Qian Wei;Nan Su;Yuhao Zhang;Zhaoyan Shen;Dongxiao Yu;Lei Ju

log-structured merge (LSM) tree-based key-value (KV) stores organize writes into hierarchical batches to optimize write performance. However, the notorious compaction process and multilevel query mechanism of LSM-tree severely hurt system performance. Our preliminary experiments show that 1) When compaction occurs in the

$L_{0}$

and

$L_{1}$

of the LSM-tree, it may saturate system computation and memory resources, ultimately causing the entire system to stall and 2) large number of iterative retrievals across multiple levels is usually required to locate the queried data, while redundant key range overlap in

$L_{0}$

further increases the overhead. Based on these observations, we introduce Re-LSM+, a resistive random-access memory (ReRAM)-based Processing-in-Memory framework for LSM-based KV Stores. In Re-LSM+, we offload compaction tasks from the higher levels of the LSM-tree to the PIM processing part. A highly parallel ReRAM compaction accelerator is designed by breaking down the three-phase compaction process into basic logic operations. Additionally, we design an index table and a multilayer Bloom filter for different levels to improve the query efficiency of the LSM-tree. Evaluation results from db_bench show that Re-LSM+ achieves a

$2.37times $

improvement in random write throughput compared to RocksDB. Furthermore, the ReRAM-based compaction accelerator achieves a

$68.16times $

speedup over the CPU-based implementation and reduces energy consumption to

$25.5times $

.

基于日志结构合并（LSM）树的键值存储（KV）将写入组织成分层批，以优化写入性能。然而，lsm树的压缩过程和多级查询机制严重影响了系统的性能。我们的初步实验表明，1)当LSM-tree的$L_{0}$和$L_{1}$发生压缩时，可能会使系统计算和内存资源饱和，最终导致整个系统停滞；2)通常需要跨多个级别进行大量迭代检索才能定位查询的数据，而$L_{0}$中的冗余键范围重叠进一步增加了开销。基于这些观察，我们介绍了Re-LSM+，一种基于电阻随机存取存储器（ReRAM）的内存中处理框架，用于基于lsm的KV存储器。在Re-LSM+中，我们将压缩任务从lsm树的较高级别卸载到PIM处理部分。通过将三个阶段的压缩过程分解为基本的逻辑操作，设计了一个高度并行的ReRAM压缩加速器。此外，为了提高lsm树的查询效率，我们设计了不同层次的索引表和多层Bloom过滤器。db_bench的评估结果表明，与RocksDB相比，Re-LSM+在随机写吞吐量方面提高了2.37倍。此外，基于reram的压缩加速器比基于cpu的实现实现了68.16倍的加速，并将能耗降低到25.5倍。

{"title":"A ReRAM-Based Processing-in-Memory Framework for LSM-Based Key-Value Store","authors":"Zehao Chen;Kai Zhang;Qian Wei;Nan Su;Yuhao Zhang;Zhaoyan Shen;Dongxiao Yu;Lei Ju","doi":"10.1109/TCAD.2025.3591410","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3591410","url":null,"abstract":"log-structured merge (LSM) tree-based key-value (KV) stores organize writes into hierarchical batches to optimize write performance. However, the notorious compaction process and multilevel query mechanism of LSM-tree severely hurt system performance. Our preliminary experiments show that 1) When compaction occurs in the <inline-formula> <tex-math>$L_{0}$ </tex-math></inline-formula> and <inline-formula> <tex-math>$L_{1}$ </tex-math></inline-formula> of the LSM-tree, it may saturate system computation and memory resources, ultimately causing the entire system to stall and 2) large number of iterative retrievals across multiple levels is usually required to locate the queried data, while redundant key range overlap in <inline-formula> <tex-math>$L_{0}$ </tex-math></inline-formula> further increases the overhead. Based on these observations, we introduce Re-LSM+, a resistive random-access memory (ReRAM)-based Processing-in-Memory framework for LSM-based KV Stores. In Re-LSM+, we offload compaction tasks from the higher levels of the LSM-tree to the PIM processing part. A highly parallel ReRAM compaction accelerator is designed by breaking down the three-phase compaction process into basic logic operations. Additionally, we design an index table and a multilayer Bloom filter for different levels to improve the query efficiency of the LSM-tree. Evaluation results from db_bench show that Re-LSM+ achieves a <inline-formula> <tex-math>$2.37times $ </tex-math></inline-formula> improvement in random write throughput compared to RocksDB. Furthermore, the ReRAM-based compaction accelerator achieves a <inline-formula> <tex-math>$68.16times $ </tex-math></inline-formula> speedup over the CPU-based implementation and reduces energy consumption to <inline-formula> <tex-math>$25.5times $ </tex-math></inline-formula>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"1061-1074"},"PeriodicalIF":2.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploration of Karatsuba Algorithm for Efficient Barrett Modular Multiplication 高效Barrett模乘法的Karatsuba算法探索

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2025-07-18 DOI: 10.1109/TCAD.2025.3590663

Bo Zhang;Mingzhe Zhang;Shoumeng Yan

As a key operation in contemporary cryptosystems, modular multiplication (MM) occupies non-negligible latency and area. We first show optimizations of the k-term Karatsuba algorithm for

$ lfloor AB/r^{k} rfloor $

and

$AB text {mod}~r^{k}$

that play a significant role in MM. We prove the bijective mapping between

$ lfloor AB/r^{k} rfloor $

and

$AB text {mod}~r^{k}$

, and propose four methods to build efficient Karatsuba multiplications with arbitrary k values. For

$kin [{1, 32}]$

, the multiplication cost for

$ lfloor AB/r^{k} rfloor $

and

$AB text {mod}~r^{k}$

is 25.04% less than that for AB on average. Furthermore, we investigate the correlation between operand bitwidth N of MM and decomposition factor k of the Karatsuba algorithm. Karatsuba multiplication with a larger k needs less area in the multiplication phase, but also has a more complex implementation in the evaluation and interpolation phases. Experimental results for Barrett MM with

$N=32$

, 64, 128, 256 and

$k=1$

, 2, 4, 8 show that MM achieves the minimal area when

$N/k=32$

. For instance, our proposed design when

$N=256$

saves 21.57% area and 25.71% area/throughput, compared with state-of-art designs.

模乘法运算作为现代密码系统中的一种关键运算，占用了不可忽略的延迟和面积。我们首先展示了k项Karatsuba算法对$ lfloor AB/r^{k} rfloor $和$AB text {mod}~r^{k}$的优化，它们在MM中起着重要的作用。我们证明了$ lfloor AB/r^{k} rfloor $和$AB text {mod}~r^{k}$之间的双射映射，并提出了四种方法来构建具有任意k值的高效Karatsuba乘法。对于$kin [{1,32}]$ $, $ lfloor AB/r^{k} rfloor $与$AB text {mod}~r^{k}$的乘法代价平均比AB的乘法代价小25.04%。此外，我们还研究了MM的操作数位宽N与Karatsuba算法的分解因子k之间的关系。k较大的Karatsuba乘法在乘法阶段需要较少的面积，但在求值和插值阶段的实现也比较复杂。对$N=32$、64 $、128 $、256 $k=1$、2$、4 $和8 $ Barrett MM的实验结果表明，当$N/k=32$时，MM的面积最小。例如，当$N=256$时，与最先进的设计相比，我们提出的设计节省了21.57%的面积和25.71%的面积/吞吐量。

{"title":"Exploration of Karatsuba Algorithm for Efficient Barrett Modular Multiplication","authors":"Bo Zhang;Mingzhe Zhang;Shoumeng Yan","doi":"10.1109/TCAD.2025.3590663","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3590663","url":null,"abstract":"As a key operation in contemporary cryptosystems, modular multiplication (MM) occupies non-negligible latency and area. We first show optimizations of the k-term Karatsuba algorithm for <inline-formula> <tex-math>$ lfloor AB/r^{k} rfloor $ </tex-math></inline-formula> and <inline-formula> <tex-math>$AB text {mod}~r^{k}$ </tex-math></inline-formula> that play a significant role in MM. We prove the bijective mapping between <inline-formula> <tex-math>$ lfloor AB/r^{k} rfloor $ </tex-math></inline-formula> and <inline-formula> <tex-math>$AB text {mod}~r^{k}$ </tex-math></inline-formula>, and propose four methods to build efficient Karatsuba multiplications with arbitrary k values. For <inline-formula> <tex-math>$kin [{1, 32}]$ </tex-math></inline-formula>, the multiplication cost for <inline-formula> <tex-math>$ lfloor AB/r^{k} rfloor $ </tex-math></inline-formula> and <inline-formula> <tex-math>$AB text {mod}~r^{k}$ </tex-math></inline-formula> is 25.04% less than that for AB on average. Furthermore, we investigate the correlation between operand bitwidth N of MM and decomposition factor k of the Karatsuba algorithm. Karatsuba multiplication with a larger k needs less area in the multiplication phase, but also has a more complex implementation in the evaluation and interpolation phases. Experimental results for Barrett MM with <inline-formula> <tex-math>$N=32$ </tex-math></inline-formula>, 64, 128, 256 and <inline-formula> <tex-math>$k=1$ </tex-math></inline-formula>, 2, 4, 8 show that MM achieves the minimal area when <inline-formula> <tex-math>$N/k=32$ </tex-math></inline-formula>. For instance, our proposed design when <inline-formula> <tex-math>$N=256$ </tex-math></inline-formula> saves 21.57% area and 25.71% area/throughput, compared with state-of-art designs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"867-881"},"PeriodicalIF":2.9,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

QNU-CPN: A Low-Power Single-Event Quadruple-Node-Upset Recovery Latch QNU-CPN：一种低功耗单事件四节点干扰恢复锁存器

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2025-07-18 DOI: 10.1109/TCAD.2025.3590658

Zhengfeng Huang;Linya Qiu;Shicheng Yang;Xiaolei Wang;Yingchun Lu;Jun Pan;Fan Cheng;Xiaoqing Wen;Aibin Yan

Integrated circuit are increasingly sensitive to radiation-induced multinode upset in advanced cMOS technology. This article proposes a novel low-power quadruple-node-upset (QNU) recovery latch (QNU-CPN), which is based on the feedback interconnection of twenty-two input-split C-element with P-input and N-input (CPNs) to achieve high reliability. Post-layout simulation results for 45-nm cMOS by HSPICE technology show that the proposed QNU-CPN latch exhibits a reduction in power consumption by an average of 56.45%, a reduction in power-delay product (PDP) by an average of 56.92%, a reduction in area-PDP (APDP) by an average of 58.59%, and a reduction in setup time by an average of 11.11%, in comparison to four other existing QNU recovery latch (LDAVPM, QRHIL, QRHIL-LC, MURLAV). Furthermore, this article proposes the recovery rate calculation algorithm method that can calculate the recovery rate based on the configuration of multiple fault-tolerant components.

在先进的cMOS技术中，集成电路对辐射引起的多节点干扰越来越敏感。本文提出了一种新型的低功耗四节点干扰（QNU）恢复锁存器（QNU- cpn），该锁存器基于22个p输入和n输入c单元（cpn）的反馈互联，以实现高可靠性。基于HSPICE技术的45纳米cMOS布局后仿真结果表明，与现有的四种QNU恢复锁存器（LDAVPM、QRHIL、QRHIL- lc、MURLAV）相比，所提出的QNU- cpn锁存器功耗平均降低56.45%，功率延迟积（PDP）平均降低56.92%，面积-PDP （APDP）平均降低58.59%，设置时间平均降低11.11%。在此基础上，提出了基于多容错组件配置计算容错率的恢复算法方法。

引用次数: 0