Pub Date : 2025-07-24DOI: 10.1109/TCAD.2025.3592608
Baoze Zhao;Conghui Luo;Wenjin Huang;Yihua Huang
zero-knowledge proof (ZKP) has gained widespread application across various domains, demonstrating remarkable success. Among ZKP algorithms, zero-knowledge succinct noninteractive argument of knowledge (zk-SNARK) is the most widely used. However, despite its advantages of small proof size and succinct verification, zk-SNARK proof generation faces significant challenges due to high computational demands, limiting its practical application. This article addresses these challenges by accelerating two computationally intensive operations in zk-SNARK proof generation, number theory transformation (NTT) and multiscalar multiplication (MSM), using FPGAs. In the implementation of NTT hardware accelerators for zk-SNARK applications, the traditional 4-step algorithm often encounters conflicts between off-chip bandwidth and on-chip memory. To resolve this issue, we propose an innovative approach that enhances accelerator performance by recursively applying the 4-step algorithm to create a more efficient 6-step algorithm. For MSM hardware acceleration on FPGAs, existing works are often constrained by limited on-chip memory, restricting the use of longer slice lengths, which are crucial for higher performance when using the commenly used Pippenger algorithm. To overcome this limitation, we introduce the Batch Method, optimizing off-chip memory consumption, enabling the accelerator to use longer slice lengths and achieve superior performance. Experimental results demonstrate that the proposed NTT design achieves $1.76times $ higher DSP efficiency than the SAM. Meanwhile, the proposed MSM design demonstrates $1.24times $ higher performance than the MSMAC with aligned frequency and number of PEs. When benchmarked against the GPU implementation GZKP, our MSM design exhibits $1.16times $ and $1.46times $ higher performance than GZKP for BLS12-381 and BN-254, respectively. However, the NTT design remains at a disadvantage due to the bandwidth limitation between our platform, Xilinx Alveo U250, and GZKP’s platforms, Nvidia GTX 1080 Ti and Nvidia Tesla V100.
零知识证明(ZKP)在各个领域得到了广泛的应用,取得了显著的成功。在ZKP算法中,零知识简捷非交互知识论证(zk-SNARK)应用最为广泛。然而,尽管具有证明尺寸小、验证简洁等优点,但由于计算量大,zk-SNARK证明生成面临重大挑战,限制了其实际应用。本文通过使用fpga加速zk-SNARK证明生成中的两个计算密集型操作,即数论变换(NTT)和多标量乘法(MSM)来解决这些挑战。在为zk-SNARK应用实现NTT硬件加速器时,传统的4步算法经常会遇到片外带宽和片内内存的冲突。为了解决这个问题,我们提出了一种创新的方法,通过递归地应用4步算法来创建更高效的6步算法,从而提高加速器的性能。对于fpga上的MSM硬件加速,现有的工作通常受到片上内存有限的限制,限制了更长的切片长度的使用,这对于使用常用的Pippenger算法时提高性能至关重要。为了克服这一限制,我们引入了批处理方法,优化片外内存消耗,使加速器能够使用更长的片长度并获得卓越的性能。实验结果表明,该NTT设计的DSP效率是SAM的1.76倍。同时,所提出的MSM设计的性能比频率和pe数对齐的MSMAC高1.24倍。当对GPU实现GZKP进行基准测试时,我们的MSM设计的性能分别比BLS12-381和BN-254的GZKP高1.16倍和1.46倍。然而,由于我们的平台Xilinx Alveo U250和GZKP的平台Nvidia GTX 1080 Ti和Nvidia Tesla V100之间的带宽限制,NTT的设计仍然处于劣势。
{"title":"FPGA-Based Hardware Accelerator of zk-SNARK","authors":"Baoze Zhao;Conghui Luo;Wenjin Huang;Yihua Huang","doi":"10.1109/TCAD.2025.3592608","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3592608","url":null,"abstract":"zero-knowledge proof (ZKP) has gained widespread application across various domains, demonstrating remarkable success. Among ZKP algorithms, zero-knowledge succinct noninteractive argument of knowledge (zk-SNARK) is the most widely used. However, despite its advantages of small proof size and succinct verification, zk-SNARK proof generation faces significant challenges due to high computational demands, limiting its practical application. This article addresses these challenges by accelerating two computationally intensive operations in zk-SNARK proof generation, number theory transformation (NTT) and multiscalar multiplication (MSM), using FPGAs. In the implementation of NTT hardware accelerators for zk-SNARK applications, the traditional 4-step algorithm often encounters conflicts between off-chip bandwidth and on-chip memory. To resolve this issue, we propose an innovative approach that enhances accelerator performance by recursively applying the 4-step algorithm to create a more efficient 6-step algorithm. For MSM hardware acceleration on FPGAs, existing works are often constrained by limited on-chip memory, restricting the use of longer slice lengths, which are crucial for higher performance when using the commenly used Pippenger algorithm. To overcome this limitation, we introduce the Batch Method, optimizing off-chip memory consumption, enabling the accelerator to use longer slice lengths and achieve superior performance. Experimental results demonstrate that the proposed NTT design achieves <inline-formula> <tex-math>$1.76times $ </tex-math></inline-formula> higher DSP efficiency than the SAM. Meanwhile, the proposed MSM design demonstrates <inline-formula> <tex-math>$1.24times $ </tex-math></inline-formula> higher performance than the MSMAC with aligned frequency and number of PEs. When benchmarked against the GPU implementation GZKP, our MSM design exhibits <inline-formula> <tex-math>$1.16times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.46times $ </tex-math></inline-formula> higher performance than GZKP for BLS12-381 and BN-254, respectively. However, the NTT design remains at a disadvantage due to the bandwidth limitation between our platform, Xilinx Alveo U250, and GZKP’s platforms, Nvidia GTX 1080 Ti and Nvidia Tesla V100.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"887-900"},"PeriodicalIF":2.9,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quantum computing has significantly advanced in recent years, boasting devices with hundreds of quantum bits (qubits), hinting at its potential quantum advantage over classical computing. Yet, noise in quantum devices poses significant barriers to realizing this supremacy. Understanding noise’s impact is crucial for reproducibility and application reuse; moreover, the next-generation quantum-centric supercomputing essentially requires efficient and accurate noise characterization to support system management (e.g., job scheduling), where ensuring correct functional performance (i.e., fidelity) of jobs on available quantum devices can even be higher-priority than traditional objectives. However, noise fluctuates over time, even on the same quantum device, which makes predicting the computational bounds for on-the-fly noise is vital. Noisy quantum simulation can offer insights but faces efficiency and scalability issues. In this work, we propose a data-driven workflow, namely, QuBound, to predict computational performance bounds. It decomposes historical performance traces to isolate noise sources and devises a novel encoder to embed circuit and noise information processed by a long short-term memory (LSTM) network. For evaluation, we compare QuBound with a state-of-the-art learning-based predictor, which only generates a single performance value instead of a bound. Experimental results show that the result of the existing approach falls outside of performance bounds, while all predictions from our QuBound with the assistance of performance decomposition better fit the bounds. Moreover, QuBound can efficiently produce practical bounds for various circuits with over $10^{6}$ speedup over simulation; in addition, the range from QuBound is over $10times $ narrower than the state-of-the-art analytical approach.
{"title":"Computational Performance Bounds Prediction in Quantum Computing With Unstable Noise","authors":"Jinyang Li;Samudra Dasgupta;Yuhong Song;Lei Yang;Travis Humble;Weiwen Jiang","doi":"10.1109/TCAD.2025.3592605","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3592605","url":null,"abstract":"Quantum computing has significantly advanced in recent years, boasting devices with hundreds of quantum bits (qubits), hinting at its potential quantum advantage over classical computing. Yet, noise in quantum devices poses significant barriers to realizing this supremacy. Understanding noise’s impact is crucial for reproducibility and application reuse; moreover, the next-generation quantum-centric supercomputing essentially requires efficient and accurate noise characterization to support system management (e.g., job scheduling), where ensuring correct functional performance (i.e., fidelity) of jobs on available quantum devices can even be higher-priority than traditional objectives. However, noise fluctuates over time, even on the same quantum device, which makes predicting the computational bounds for on-the-fly noise is vital. Noisy quantum simulation can offer insights but faces efficiency and scalability issues. In this work, we propose a data-driven workflow, namely, QuBound, to predict computational performance bounds. It decomposes historical performance traces to isolate noise sources and devises a novel encoder to embed circuit and noise information processed by a long short-term memory (LSTM) network. For evaluation, we compare QuBound with a state-of-the-art learning-based predictor, which only generates a single performance value instead of a bound. Experimental results show that the result of the existing approach falls outside of performance bounds, while all predictions from our QuBound with the assistance of performance decomposition better fit the bounds. Moreover, QuBound can efficiently produce practical bounds for various circuits with over <inline-formula> <tex-math>$10^{6}$ </tex-math></inline-formula> speedup over simulation; in addition, the range from QuBound is over <inline-formula> <tex-math>$10times $ </tex-math></inline-formula> narrower than the state-of-the-art analytical approach.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"969-982"},"PeriodicalIF":2.9,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Detection of soft errors and faults are one of the most critical factors in ensuring the reliability of algorithm implementations. Multiplication, as a fundamental and computationally intensive operation, is particularly vulnerable to such errors. Given its widespread use in cryptography and coding applications, detecting these errors is crucial. For example, in hash functions, even a single-bit change in the input can completely alter the output ([ideally, each bit of the output changes with a probability of ${}({1}/{2}])$ . Montgomery multiplication as an efficient multiplication method is an integral part of numerous cryptographic applications expanding both classical and post quantum cryptography. For that reason, this brief introduces a fault detection method for the multiple-precision Montgomery modular multiplication algorithm based on partial recomputation. Through extensive simulations and implementations, we demonstrate that our approach efficiently detects both permanent and transient errors with a high-success rate, while imposing modest area and time overhead on the system.
{"title":"Partial Recomputation Fault Detection Architecture for Multiple-Precision Montgomery Modular Multiplication","authors":"Saeed Aghapour;Kasra Ahmadi;Mehran Mozaffari Kermani;Reza Azarderakhsh","doi":"10.1109/TCAD.2025.3592590","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3592590","url":null,"abstract":"Detection of soft errors and faults are one of the most critical factors in ensuring the reliability of algorithm implementations. Multiplication, as a fundamental and computationally intensive operation, is particularly vulnerable to such errors. Given its widespread use in cryptography and coding applications, detecting these errors is crucial. For example, in hash functions, even a single-bit change in the input can completely alter the output ([ideally, each bit of the output changes with a probability of <inline-formula> <tex-math>${}({1}/{2}])$ </tex-math></inline-formula>. Montgomery multiplication as an efficient multiplication method is an integral part of numerous cryptographic applications expanding both classical and post quantum cryptography. For that reason, this brief introduces a fault detection method for the multiple-precision Montgomery modular multiplication algorithm based on partial recomputation. Through extensive simulations and implementations, we demonstrate that our approach efficiently detects both permanent and transient errors with a high-success rate, while imposing modest area and time overhead on the system.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"1042-1046"},"PeriodicalIF":2.9,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-24DOI: 10.1109/TCAD.2025.3592587
Pingdan Xiao;Yiliu Gu;Haoyou Jiang;Zhen Huan;Sichun Du;Qinghui Hong
model predictive control (MPC), a receding-horizon optimal control strategy, predicts system dynamics and optimizes control actions to satisfy performance and constraint requirements, making it widely adopted in control engineering. However, contemporary computing platforms struggle to meet the real-time and energy-efficient demands of MPC’s computationally intensive matrix operations, stemming from high data movement overhead, extensive circuit resource utilization, and frequent data conversions inherent in physical system interfaces. These challenges collectively impose significant latency and power penalties, particularly critical as systems grow in complexity and scale within the big-data era. This article introduces a zeroing neural network (ZNN)-based memristive neural network circuit that directly converges the MPC error function to zero in one step. Theoretical analysis and simulations validate the closed-loop circuit’s stability. For a 32-step prediction horizon, evaluations show that the control output from the proposed circuit matches the ideal digital MPC solution with 96.0% accuracy. The circuit also executes at least an order of magnitude faster and consumes less energy than traditional MPC solvers. Additionally, the circuit successfully accelerates the proposed trajectory tracking algorithm, achieving 98.0% accuracy compared with the theoretical result and $318.2times $ improvement in computation time compared to CPU.
{"title":"Memristive Neural Network Circuit Implementation of Model Predictive Control for Trajectory Tracking","authors":"Pingdan Xiao;Yiliu Gu;Haoyou Jiang;Zhen Huan;Sichun Du;Qinghui Hong","doi":"10.1109/TCAD.2025.3592587","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3592587","url":null,"abstract":"model predictive control (MPC), a receding-horizon optimal control strategy, predicts system dynamics and optimizes control actions to satisfy performance and constraint requirements, making it widely adopted in control engineering. However, contemporary computing platforms struggle to meet the real-time and energy-efficient demands of MPC’s computationally intensive matrix operations, stemming from high data movement overhead, extensive circuit resource utilization, and frequent data conversions inherent in physical system interfaces. These challenges collectively impose significant latency and power penalties, particularly critical as systems grow in complexity and scale within the big-data era. This article introduces a zeroing neural network (ZNN)-based memristive neural network circuit that directly converges the MPC error function to zero in one step. Theoretical analysis and simulations validate the closed-loop circuit’s stability. For a 32-step prediction horizon, evaluations show that the control output from the proposed circuit matches the ideal digital MPC solution with 96.0% accuracy. The circuit also executes at least an order of magnitude faster and consumes less energy than traditional MPC solvers. Additionally, the circuit successfully accelerates the proposed trajectory tracking algorithm, achieving 98.0% accuracy compared with the theoretical result and <inline-formula> <tex-math>$318.2times $ </tex-math></inline-formula> improvement in computation time compared to CPU.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"955-968"},"PeriodicalIF":2.9,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An field-programmable gate array (FPGA)-friendly processing element (PE) based on the small logarithmic floating point (SLFP) format is proposed. The proposed PEs not only support inner product but also perform various nonlinear activation functions (NAFs), which consume $674 times $ LUT6s and $7 times $ digital signal processing blocks (DSPs) and operate at 450MHz in a pipeline manner for Zynq-7000. In addition, as the distribution of SLFP numbers is not uniform, this brief revises the weight decay scheme in the quantization aware training process to explore the optimum quantized weights. Compared with INT8-based design, the proposed method balances the resource usage between look-up tables and DSPs. The accuracy loss of the quantized model based on the 8-bit SLFP is also small due to the high dynamic range of SLFP format. Moreover, since the proposed method can support different NAFs, this brief improves the quantized model accuracy by selecting an appropriate NAF from Swish, GELU, Mish and PReLU. Compared to the baseline (parameters are FP32, NAF is ReLU), the accuracy of quantized ResNet-50 and MobileNet is increased by 2.65% and −0.33%.
{"title":"FPGA-Friendly Architecture of Processing Elements for Efficient and Accurate Quantized CNNs","authors":"Botao Xiong;Shize Zhang;Xingyu Shao;Xintong He;Yuchun Chang","doi":"10.1109/TCAD.2025.3591734","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3591734","url":null,"abstract":"An field-programmable gate array (FPGA)-friendly processing element (PE) based on the small logarithmic floating point (SLFP) format is proposed. The proposed PEs not only support inner product but also perform various nonlinear activation functions (NAFs), which consume <inline-formula> <tex-math>$674 times $ </tex-math></inline-formula> LUT6s and <inline-formula> <tex-math>$7 times $ </tex-math></inline-formula> digital signal processing blocks (DSPs) and operate at 450MHz in a pipeline manner for Zynq-7000. In addition, as the distribution of SLFP numbers is not uniform, this brief revises the weight decay scheme in the quantization aware training process to explore the optimum quantized weights. Compared with INT8-based design, the proposed method balances the resource usage between look-up tables and DSPs. The accuracy loss of the quantized model based on the 8-bit SLFP is also small due to the high dynamic range of SLFP format. Moreover, since the proposed method can support different NAFs, this brief improves the quantized model accuracy by selecting an appropriate NAF from Swish, GELU, Mish and PReLU. Compared to the baseline (parameters are FP32, NAF is ReLU), the accuracy of quantized ResNet-50 and MobileNet is increased by 2.65% and −0.33%.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"882-886"},"PeriodicalIF":2.9,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-21DOI: 10.1109/TCAD.2025.3591409
Yifan Qin;Zheyu Yan;Dailin Gan;Jun Xia;Zixuan Pan;Wujie Wen;Xiaobo Sharon Hu;Yiyu Shi
Compute-in-memory accelerators built upon nonvolatile memory devices excel in energy efficiency and latency when performing deep neural network (DNN) inference, thanks to their in-situ data processing capability. However, the stochastic nature and intrinsic variations of nonvolatile memory devices often result in performance degradation during DNN inference. Introducing these nonideal device behaviors in DNN training enhances robustness, but drawbacks include limited accuracy improvement, reduced prediction confidence, and convergence issues. This arises from a mismatch between the deterministic training and nondeterministic device variations, as such training, though considering variations, relies solely on the model’s final output. In this work, inspired by control theory, we propose negative feedback training (NeFT)—a novel concept supported by theoretical analysis—to more effectively capture the multiscale noisy information throughout the network. We instantiate this concept with two specific instances, oriented variational forward (OVF) and intermediate representation snapshot (IRS). Based on device variation models extracted from measured data, extensive experiments show that our NeFT outperforms existing state-of-the-art methods with up to a 45.08% improvement in inference accuracy while reducing epistemic uncertainty, boosting output confidence, and improving convergence probability. These results underline the generality and practicality of our NeFT framework for increasing the robustness of DNNs against device variations. The source code for these two instances is available at https://github.com/YifanQin-ND/NeFT_CIM.
{"title":"NeFT: Negative Feedback Training to Improve Robustness of Compute-in-Memory DNN Accelerators","authors":"Yifan Qin;Zheyu Yan;Dailin Gan;Jun Xia;Zixuan Pan;Wujie Wen;Xiaobo Sharon Hu;Yiyu Shi","doi":"10.1109/TCAD.2025.3591409","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3591409","url":null,"abstract":"Compute-in-memory accelerators built upon nonvolatile memory devices excel in energy efficiency and latency when performing deep neural network (DNN) inference, thanks to their in-situ data processing capability. However, the stochastic nature and intrinsic variations of nonvolatile memory devices often result in performance degradation during DNN inference. Introducing these nonideal device behaviors in DNN training enhances robustness, but drawbacks include limited accuracy improvement, reduced prediction confidence, and convergence issues. This arises from a mismatch between the deterministic training and nondeterministic device variations, as such training, though considering variations, relies solely on the model’s final output. In this work, inspired by control theory, we propose negative feedback training (NeFT)—a novel concept supported by theoretical analysis—to more effectively capture the multiscale noisy information throughout the network. We instantiate this concept with two specific instances, oriented variational forward (OVF) and intermediate representation snapshot (IRS). Based on device variation models extracted from measured data, extensive experiments show that our NeFT outperforms existing state-of-the-art methods with up to a 45.08% improvement in inference accuracy while reducing epistemic uncertainty, boosting output confidence, and improving convergence probability. These results underline the generality and practicality of our NeFT framework for increasing the robustness of DNNs against device variations. The source code for these two instances is available at <uri>https://github.com/YifanQin-ND/NeFT_CIM</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"983-997"},"PeriodicalIF":2.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-21DOI: 10.1109/TCAD.2025.3591411
David Coenen;Herman Oprins
Multiscale thermal analysis in integrated circuits is required for capturing both device-level and package-level dynamics. Traditional analysis with the finite element (FE) method performs poor at multiscale tasks because of conflicting element size requirements and CPU time limitation. Machine learning (ML) algorithms can be trained with FE simulation data to perform fast and efficient temperature prediction. In this work, spatial and temporal aspects of the temperature field are treated independently and used to train two artificial neural networks (ANNs). Prior to ANN training, fundamental spatial modes [proper orthogonal decomposition (POD)] are calculated to simplify the ANN structure. In the time domain, a similar approach is used: the fundamental temporal modes, i.e., thermal step responses, are calculated and used to train the ANN. By training the ANN on step response data, the final dynamic temperature profile can be reconstructed using the convolutional operator. Using this method, a physics-informed ML workflow is established as the step response is converted to the impulse response or Green’s function, which are a known part of the analytical solution to the heat equation. The final result is an extremely fast and accurate dynamic thermal model of a chip.
{"title":"PINDAS: Physics-Informed Decoupled Spatiotemporal Artificial Neural Network for Dynamic Thermal Simulation","authors":"David Coenen;Herman Oprins","doi":"10.1109/TCAD.2025.3591411","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3591411","url":null,"abstract":"Multiscale thermal analysis in integrated circuits is required for capturing both device-level and package-level dynamics. Traditional analysis with the finite element (FE) method performs poor at multiscale tasks because of conflicting element size requirements and CPU time limitation. Machine learning (ML) algorithms can be trained with FE simulation data to perform fast and efficient temperature prediction. In this work, spatial and temporal aspects of the temperature field are treated independently and used to train two artificial neural networks (ANNs). Prior to ANN training, fundamental spatial modes [proper orthogonal decomposition (POD)] are calculated to simplify the ANN structure. In the time domain, a similar approach is used: the fundamental temporal modes, i.e., thermal step responses, are calculated and used to train the ANN. By training the ANN on step response data, the final dynamic temperature profile can be reconstructed using the convolutional operator. Using this method, a physics-informed ML workflow is established as the step response is converted to the impulse response or Green’s function, which are a known part of the analytical solution to the heat equation. The final result is an extremely fast and accurate dynamic thermal model of a chip.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"998-1006"},"PeriodicalIF":2.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-21DOI: 10.1109/TCAD.2025.3591410
Zehao Chen;Kai Zhang;Qian Wei;Nan Su;Yuhao Zhang;Zhaoyan Shen;Dongxiao Yu;Lei Ju
log-structured merge (LSM) tree-based key-value (KV) stores organize writes into hierarchical batches to optimize write performance. However, the notorious compaction process and multilevel query mechanism of LSM-tree severely hurt system performance. Our preliminary experiments show that 1) When compaction occurs in the $L_{0}$ and $L_{1}$ of the LSM-tree, it may saturate system computation and memory resources, ultimately causing the entire system to stall and 2) large number of iterative retrievals across multiple levels is usually required to locate the queried data, while redundant key range overlap in $L_{0}$ further increases the overhead. Based on these observations, we introduce Re-LSM+, a resistive random-access memory (ReRAM)-based Processing-in-Memory framework for LSM-based KV Stores. In Re-LSM+, we offload compaction tasks from the higher levels of the LSM-tree to the PIM processing part. A highly parallel ReRAM compaction accelerator is designed by breaking down the three-phase compaction process into basic logic operations. Additionally, we design an index table and a multilayer Bloom filter for different levels to improve the query efficiency of the LSM-tree. Evaluation results from db_bench show that Re-LSM+ achieves a $2.37times $ improvement in random write throughput compared to RocksDB. Furthermore, the ReRAM-based compaction accelerator achieves a $68.16times $ speedup over the CPU-based implementation and reduces energy consumption to $25.5times $ .
{"title":"A ReRAM-Based Processing-in-Memory Framework for LSM-Based Key-Value Store","authors":"Zehao Chen;Kai Zhang;Qian Wei;Nan Su;Yuhao Zhang;Zhaoyan Shen;Dongxiao Yu;Lei Ju","doi":"10.1109/TCAD.2025.3591410","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3591410","url":null,"abstract":"log-structured merge (LSM) tree-based key-value (KV) stores organize writes into hierarchical batches to optimize write performance. However, the notorious compaction process and multilevel query mechanism of LSM-tree severely hurt system performance. Our preliminary experiments show that 1) When compaction occurs in the <inline-formula> <tex-math>$L_{0}$ </tex-math></inline-formula> and <inline-formula> <tex-math>$L_{1}$ </tex-math></inline-formula> of the LSM-tree, it may saturate system computation and memory resources, ultimately causing the entire system to stall and 2) large number of iterative retrievals across multiple levels is usually required to locate the queried data, while redundant key range overlap in <inline-formula> <tex-math>$L_{0}$ </tex-math></inline-formula> further increases the overhead. Based on these observations, we introduce Re-LSM+, a resistive random-access memory (ReRAM)-based Processing-in-Memory framework for LSM-based KV Stores. In Re-LSM+, we offload compaction tasks from the higher levels of the LSM-tree to the PIM processing part. A highly parallel ReRAM compaction accelerator is designed by breaking down the three-phase compaction process into basic logic operations. Additionally, we design an index table and a multilayer Bloom filter for different levels to improve the query efficiency of the LSM-tree. Evaluation results from db_bench show that Re-LSM+ achieves a <inline-formula> <tex-math>$2.37times $ </tex-math></inline-formula> improvement in random write throughput compared to RocksDB. Furthermore, the ReRAM-based compaction accelerator achieves a <inline-formula> <tex-math>$68.16times $ </tex-math></inline-formula> speedup over the CPU-based implementation and reduces energy consumption to <inline-formula> <tex-math>$25.5times $ </tex-math></inline-formula>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"1061-1074"},"PeriodicalIF":2.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-18DOI: 10.1109/TCAD.2025.3590663
Bo Zhang;Mingzhe Zhang;Shoumeng Yan
As a key operation in contemporary cryptosystems, modular multiplication (MM) occupies non-negligible latency and area. We first show optimizations of the k-term Karatsuba algorithm for $ lfloor AB/r^{k} rfloor $ and $AB text {mod}~r^{k}$ that play a significant role in MM. We prove the bijective mapping between $ lfloor AB/r^{k} rfloor $ and $AB text {mod}~r^{k}$ , and propose four methods to build efficient Karatsuba multiplications with arbitrary k values. For $kin [{1, 32}]$ , the multiplication cost for $ lfloor AB/r^{k} rfloor $ and $AB text {mod}~r^{k}$ is 25.04% less than that for AB on average. Furthermore, we investigate the correlation between operand bitwidth N of MM and decomposition factor k of the Karatsuba algorithm. Karatsuba multiplication with a larger k needs less area in the multiplication phase, but also has a more complex implementation in the evaluation and interpolation phases. Experimental results for Barrett MM with $N=32$ , 64, 128, 256 and $k=1$ , 2, 4, 8 show that MM achieves the minimal area when $N/k=32$ . For instance, our proposed design when $N=256$ saves 21.57% area and 25.71% area/throughput, compared with state-of-art designs.
模乘法运算作为现代密码系统中的一种关键运算,占用了不可忽略的延迟和面积。我们首先展示了k项Karatsuba算法对$ lfloor AB/r^{k} rfloor $和$AB text {mod}~r^{k}$的优化,它们在MM中起着重要的作用。我们证明了$ lfloor AB/r^{k} rfloor $和$AB text {mod}~r^{k}$之间的双射映射,并提出了四种方法来构建具有任意k值的高效Karatsuba乘法。对于$kin [{1,32}]$ $, $ lfloor AB/r^{k} rfloor $与$AB text {mod}~r^{k}$的乘法代价平均比AB的乘法代价小25.04%。此外,我们还研究了MM的操作数位宽N与Karatsuba算法的分解因子k之间的关系。k较大的Karatsuba乘法在乘法阶段需要较少的面积,但在求值和插值阶段的实现也比较复杂。对$N=32$、64 $、128 $、256 $k=1$、2$、4 $和8 $ Barrett MM的实验结果表明,当$N/k=32$时,MM的面积最小。例如,当$N=256$时,与最先进的设计相比,我们提出的设计节省了21.57%的面积和25.71%的面积/吞吐量。
{"title":"Exploration of Karatsuba Algorithm for Efficient Barrett Modular Multiplication","authors":"Bo Zhang;Mingzhe Zhang;Shoumeng Yan","doi":"10.1109/TCAD.2025.3590663","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3590663","url":null,"abstract":"As a key operation in contemporary cryptosystems, modular multiplication (MM) occupies non-negligible latency and area. We first show optimizations of the k-term Karatsuba algorithm for <inline-formula> <tex-math>$ lfloor AB/r^{k} rfloor $ </tex-math></inline-formula> and <inline-formula> <tex-math>$AB text {mod}~r^{k}$ </tex-math></inline-formula> that play a significant role in MM. We prove the bijective mapping between <inline-formula> <tex-math>$ lfloor AB/r^{k} rfloor $ </tex-math></inline-formula> and <inline-formula> <tex-math>$AB text {mod}~r^{k}$ </tex-math></inline-formula>, and propose four methods to build efficient Karatsuba multiplications with arbitrary k values. For <inline-formula> <tex-math>$kin [{1, 32}]$ </tex-math></inline-formula>, the multiplication cost for <inline-formula> <tex-math>$ lfloor AB/r^{k} rfloor $ </tex-math></inline-formula> and <inline-formula> <tex-math>$AB text {mod}~r^{k}$ </tex-math></inline-formula> is 25.04% less than that for AB on average. Furthermore, we investigate the correlation between operand bitwidth N of MM and decomposition factor k of the Karatsuba algorithm. Karatsuba multiplication with a larger k needs less area in the multiplication phase, but also has a more complex implementation in the evaluation and interpolation phases. Experimental results for Barrett MM with <inline-formula> <tex-math>$N=32$ </tex-math></inline-formula>, 64, 128, 256 and <inline-formula> <tex-math>$k=1$ </tex-math></inline-formula>, 2, 4, 8 show that MM achieves the minimal area when <inline-formula> <tex-math>$N/k=32$ </tex-math></inline-formula>. For instance, our proposed design when <inline-formula> <tex-math>$N=256$ </tex-math></inline-formula> saves 21.57% area and 25.71% area/throughput, compared with state-of-art designs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"867-881"},"PeriodicalIF":2.9,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-18DOI: 10.1109/TCAD.2025.3590658
Zhengfeng Huang;Linya Qiu;Shicheng Yang;Xiaolei Wang;Yingchun Lu;Jun Pan;Fan Cheng;Xiaoqing Wen;Aibin Yan
Integrated circuit are increasingly sensitive to radiation-induced multinode upset in advanced cMOS technology. This article proposes a novel low-power quadruple-node-upset (QNU) recovery latch (QNU-CPN), which is based on the feedback interconnection of twenty-two input-split C-element with P-input and N-input (CPNs) to achieve high reliability. Post-layout simulation results for 45-nm cMOS by HSPICE technology show that the proposed QNU-CPN latch exhibits a reduction in power consumption by an average of 56.45%, a reduction in power-delay product (PDP) by an average of 56.92%, a reduction in area-PDP (APDP) by an average of 58.59%, and a reduction in setup time by an average of 11.11%, in comparison to four other existing QNU recovery latch (LDAVPM, QRHIL, QRHIL-LC, MURLAV). Furthermore, this article proposes the recovery rate calculation algorithm method that can calculate the recovery rate based on the configuration of multiple fault-tolerant components.
{"title":"QNU-CPN: A Low-Power Single-Event Quadruple-Node-Upset Recovery Latch","authors":"Zhengfeng Huang;Linya Qiu;Shicheng Yang;Xiaolei Wang;Yingchun Lu;Jun Pan;Fan Cheng;Xiaoqing Wen;Aibin Yan","doi":"10.1109/TCAD.2025.3590658","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3590658","url":null,"abstract":"Integrated circuit are increasingly sensitive to radiation-induced multinode upset in advanced cMOS technology. This article proposes a novel low-power quadruple-node-upset (QNU) recovery latch (QNU-CPN), which is based on the feedback interconnection of twenty-two input-split C-element with P-input and N-input (CPNs) to achieve high reliability. Post-layout simulation results for 45-nm cMOS by HSPICE technology show that the proposed QNU-CPN latch exhibits a reduction in power consumption by an average of 56.45%, a reduction in power-delay product (PDP) by an average of 56.92%, a reduction in area-PDP (APDP) by an average of 58.59%, and a reduction in setup time by an average of 11.11%, in comparison to four other existing QNU recovery latch (LDAVPM, QRHIL, QRHIL-LC, MURLAV). Furthermore, this article proposes the recovery rate calculation algorithm method that can calculate the recovery rate based on the configuration of multiple fault-tolerant components.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"855-866"},"PeriodicalIF":2.9,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}