This article introduces OpenLS-DGF, an adaptive logic synthesis dataset generation framework, to enhance machine-learning (ML) applications within the logic synthesis process. Previous dataset generation flows were tailored for specific tasks or lacked integrated ML capabilities. While OpenLS-DGF supports various ML tasks by encapsulating the three fundamental steps of logic synthesis: 1) Boolean representation; 2) logic optimization; and 3) technology mapping. It preserves the original information in both Verilog and ML-friendly GraphML formats. The Verilog files offer semi-customizable capabilities, enabling researchers to insert additional steps and incrementally refine the generated dataset. Furthermore, OpenLS-DGF includes an adaptive circuit engine that facilitates the final dataset management and downstream tasks. The generated OpenLS-D-v1 dataset comprises 46 combinational designs from established benchmarks, totaling over 966 000 Boolean circuits. OpenLS-D-v1 supports integrating new data features, making it more versatile for new tasks. This article demonstrates the versatility of OpenLS-D-v1 through four distinct downstream tasks: circuit classification, circuit ranking, quality of results (QoR) prediction, and probability prediction. Each task is chosen to represent essential steps of logic synthesis, and the experimental results show the generated dataset from OpenLS-DGF achieves prominent diversity and applicability. The source code and datasets are available at https://github.com/Logic-Factory/ACE/blob/master/OpenLS-DGF.
{"title":"OpenLS-DGF: An Adaptive Open-Source Dataset Generation Framework for Machine-Learning Tasks in Logic Synthesis","authors":"Liwei Ni;Rui Wang;Miao Liu;Xingyu Meng;Xiaoze Lin;Junfeng Liu;Guojie Luo;Zhufei Chu;Weikang Qian;Xiaoyan Yang;Biwei Xie;Xingquan Li;Huawei Li","doi":"10.1109/TCAD.2025.3555506","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555506","url":null,"abstract":"This article introduces OpenLS-DGF, an adaptive logic synthesis dataset generation framework, to enhance machine-learning (ML) applications within the logic synthesis process. Previous dataset generation flows were tailored for specific tasks or lacked integrated ML capabilities. While OpenLS-DGF supports various ML tasks by encapsulating the three fundamental steps of logic synthesis: 1) Boolean representation; 2) logic optimization; and 3) technology mapping. It preserves the original information in both Verilog and ML-friendly GraphML formats. The Verilog files offer semi-customizable capabilities, enabling researchers to insert additional steps and incrementally refine the generated dataset. Furthermore, OpenLS-DGF includes an adaptive circuit engine that facilitates the final dataset management and downstream tasks. The generated OpenLS-D-v1 dataset comprises 46 combinational designs from established benchmarks, totaling over 966 000 Boolean circuits. OpenLS-D-v1 supports integrating new data features, making it more versatile for new tasks. This article demonstrates the versatility of OpenLS-D-v1 through four distinct downstream tasks: circuit classification, circuit ranking, quality of results (QoR) prediction, and probability prediction. Each task is chosen to represent essential steps of logic synthesis, and the experimental results show the generated dataset from OpenLS-DGF achieves prominent diversity and applicability. The source code and datasets are available at <uri>https://github.com/Logic-Factory/ACE/blob/master/OpenLS-DGF</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3830-3843"},"PeriodicalIF":2.9,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-27DOI: 10.1109/TCAD.2025.3555513
Yang Wang;Hanlong Chen;Wang Lin;Zuohua Ding
Barrier certificate generation is an ingenious and powerful approach for safety verification of cyber-physical systems. This article suggests a new learning and verification framework that helps to achieve the balance between the representation ability and the verification efficiency for neural barrier certificates. In the learning phase, it learns candidate barrier certificates represented as convex difference neural networks (CDiNNs). Since CDiNNs can be rewritten as difference of convex (DC) functions that can express any twice differentiable function, thus have outstanding representation ability and flexibility. In the verification phase, it employs an efficient approach for formally verifying the validity of the neural candidates via DC programming. Due to the convexity-based structure, CDiNNs can significantly facilitate the verification process. We conduct an experimental evaluation over a set of benchmarks, which validates that our method is much more efficient and effective than the state-of-the-art approaches.
{"title":"Formal Synthesis of Neural Barrier Certificates for Dynamical Systems via DC Programming","authors":"Yang Wang;Hanlong Chen;Wang Lin;Zuohua Ding","doi":"10.1109/TCAD.2025.3555513","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555513","url":null,"abstract":"Barrier certificate generation is an ingenious and powerful approach for safety verification of cyber-physical systems. This article suggests a new learning and verification framework that helps to achieve the balance between the representation ability and the verification efficiency for neural barrier certificates. In the learning phase, it learns candidate barrier certificates represented as convex difference neural networks (CDiNNs). Since CDiNNs can be rewritten as difference of convex (DC) functions that can express any twice differentiable function, thus have outstanding representation ability and flexibility. In the verification phase, it employs an efficient approach for formally verifying the validity of the neural candidates via DC programming. Due to the convexity-based structure, CDiNNs can significantly facilitate the verification process. We conduct an experimental evaluation over a set of benchmarks, which validates that our method is much more efficient and effective than the state-of-the-art approaches.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"4038-4042"},"PeriodicalIF":2.9,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145100356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-26DOI: 10.1109/TCAD.2025.3573685
Liang Xiao;Shiju Lin;Jinwei Liu;Qinkai Duan;Tsung-Yi Ho;Evangeline F. Y. Young
Global routing plays a crucial role in electronic design automation (EDA), serving not only as a means of optimizing routing but also as a tool for estimating routability in earlier stages, such as logic synthesis and physical planning. However, these scenarios often require global routing on unpartitioned large designs, posing unique challenges in scalability, both in terms of runtime and design size. To tackle this issue, this article introduces useful techniques for parallelizing large-scale global routing that can significantly increase parallelism and thus reduce runtime. We also propose a new flexible layer transition technique to increase the flexibility and routing quality of directed acyclic graph (DAG) routing. Building upon these techniques, we have developed an open-source GPU-based global router that achieves state-of-the-art results in the latest ISPD’24 Contest benchmarks, thereby showcasing the effectiveness of our methods.
{"title":"InstantGR: Scalable GPU Parallelization for 3-D Global Routing","authors":"Liang Xiao;Shiju Lin;Jinwei Liu;Qinkai Duan;Tsung-Yi Ho;Evangeline F. Y. Young","doi":"10.1109/TCAD.2025.3573685","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3573685","url":null,"abstract":"Global routing plays a crucial role in electronic design automation (EDA), serving not only as a means of optimizing routing but also as a tool for estimating routability in earlier stages, such as logic synthesis and physical planning. However, these scenarios often require global routing on unpartitioned large designs, posing unique challenges in scalability, both in terms of runtime and design size. To tackle this issue, this article introduces useful techniques for parallelizing large-scale global routing that can significantly increase parallelism and thus reduce runtime. We also propose a new flexible layer transition technique to increase the flexibility and routing quality of directed acyclic graph (DAG) routing. Building upon these techniques, we have developed an open-source GPU-based global router that achieves state-of-the-art results in the latest ISPD’24 Contest benchmarks, thereby showcasing the effectiveness of our methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"441-452"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11015529","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fully Homomorphic encryption (FHE) enables high-level security but with a heavy computation workload, necessitating software-hardware co-design for aggressive acceleration. Recent works on specialized accelerators for HE evaluation have made significant progress in supporting lightweight RNS-CKKS applications, especially those with high-density in-memory computing techniques. To fulfill higher computational demands for more general applications, this article proposes multicluster HE accelerating system (MCHEAS), an accelerating system comprising multiple in-situ HE processing accelerators, each functioning as a cluster to perform large-parameter RNS-CKKS evaluation collaboratively. MCHEAS features optimization strategies including the synchronous, preemptive swap, square-diagonal, and odd-even index separation. Using these strategies to compile the computation and transmission of number theoretic transform (NTT) coefficients, the method optimizes the intercluster data swaps, a major bottleneck in NTT computations. Evaluations show that under 1 GHz, with different intercluster data transfer bandwidths, our approach accelerates NTT computations by 26.40% to 51.75%. MCHEAS also improves computing unit utilization by 10.30% to 33.97%, with a maximum peak utilization rate of up to 99.62%. MCHEAS achieves 17.63% to 34.67% speedups for HE operations involving NTT, and 15.12% to 30.62% speedups for demonstrated applications, while enhancing the computing units’ utilization by 5.18% to 21.87% during application execution. Furthermore, we compare MCHEAS with SOTA designs under a specific intercluster data transfer bandwidth, achieving up to $81.45times $ their area efficiencies in applications.
{"title":"MCHEAS: Optimizing Large-Parameter NTT Over Multicluster In-Situ FHE Accelerating System","authors":"Zhenyu Guan;Yongqing Zhu;Luchang Lei;Hongyang Jia;Yi Chen;Bo Zhang;Changrui Ren;Jin Dong;Song Bian","doi":"10.1109/TCAD.2025.3555191","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555191","url":null,"abstract":"Fully Homomorphic encryption (FHE) enables high-level security but with a heavy computation workload, necessitating software-hardware co-design for aggressive acceleration. Recent works on specialized accelerators for HE evaluation have made significant progress in supporting lightweight RNS-CKKS applications, especially those with high-density in-memory computing techniques. To fulfill higher computational demands for more general applications, this article proposes multicluster HE accelerating system (MCHEAS), an accelerating system comprising multiple in-situ HE processing accelerators, each functioning as a cluster to perform large-parameter RNS-CKKS evaluation collaboratively. MCHEAS features optimization strategies including the synchronous, preemptive swap, square-diagonal, and odd-even index separation. Using these strategies to compile the computation and transmission of number theoretic transform (NTT) coefficients, the method optimizes the intercluster data swaps, a major bottleneck in NTT computations. Evaluations show that under 1 GHz, with different intercluster data transfer bandwidths, our approach accelerates NTT computations by 26.40% to 51.75%. MCHEAS also improves computing unit utilization by 10.30% to 33.97%, with a maximum peak utilization rate of up to 99.62%. MCHEAS achieves 17.63% to 34.67% speedups for HE operations involving NTT, and 15.12% to 30.62% speedups for demonstrated applications, while enhancing the computing units’ utilization by 5.18% to 21.87% during application execution. Furthermore, we compare MCHEAS with SOTA designs under a specific intercluster data transfer bandwidth, achieving up to <inline-formula> <tex-math>$81.45times $ </tex-math></inline-formula> their area efficiencies in applications.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3683-3696"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-26DOI: 10.1109/TCAD.2025.3555192
Haishuang Fan;Rui Meng;Qichu Sun;Jingya Wu;Wenyan Lu;Xiaowei Li;Guihai Yan
Graphs play an important role in various applications. With the rapid expansion of vertices in real life, existing large-scale graph processing frameworks on CPUs and GPUs encounter challenges in optimizing cache usage due to irregular memory access patterns. To address this, graph reordering has been proposed to improve the locality of the graph, but introduces significant overhead without delivering substantial end-to-end performance improvement. While there have been many FPGA-based accelerators for graph processing, achieving high throughput often requires complex graph prepossessing on CPUs. Therefore, implementing an efficient end-to-end graph processing system remains challenging. This article introduces GRACE, an end-to-end FPGA-based graph processing accelerator with a graph reordering engine and a pull-based vertex-centric programming model (PL-VCPM) Engine. First, GRACE employs a customized high-degree vertex cache (HDC) to improve memory access efficiency. Second, GRACE offloads the graph preprocessing to FPGA. We customize an efficient graph reordering engine to complete preprocessing. Third, GRACE adopts a graph pruning strategy to remove the activation and computation redundancy in graph processing. Finally, GRACE introduces a graph conflict board (GCB) to resolve data conflicts and a multiport cache to enhance parallel efficiency. Experimental results demonstrate that GRACE achieves $7.1 times $ end-to-end performance speedup over CPU and $1.8 times $ over GPU, as well as $27.3 times $ and $8.7 times $ energy efficiency over CPU and GPU. Moreover, GRACE delivers up to $34.9 times $ performance speedup compared to the state-of-the-art FPGA accelerator.
{"title":"GRACE: An End-to-End Graph Processing Accelerator on FPGA With Graph Reordering Engine","authors":"Haishuang Fan;Rui Meng;Qichu Sun;Jingya Wu;Wenyan Lu;Xiaowei Li;Guihai Yan","doi":"10.1109/TCAD.2025.3555192","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555192","url":null,"abstract":"Graphs play an important role in various applications. With the rapid expansion of vertices in real life, existing large-scale graph processing frameworks on CPUs and GPUs encounter challenges in optimizing cache usage due to irregular memory access patterns. To address this, graph reordering has been proposed to improve the locality of the graph, but introduces significant overhead without delivering substantial end-to-end performance improvement. While there have been many FPGA-based accelerators for graph processing, achieving high throughput often requires complex graph prepossessing on CPUs. Therefore, implementing an efficient end-to-end graph processing system remains challenging. This article introduces GRACE, an end-to-end FPGA-based graph processing accelerator with a graph reordering engine and a pull-based vertex-centric programming model (PL-VCPM) Engine. First, GRACE employs a customized high-degree vertex cache (HDC) to improve memory access efficiency. Second, GRACE offloads the graph preprocessing to FPGA. We customize an efficient graph reordering engine to complete preprocessing. Third, GRACE adopts a graph pruning strategy to remove the activation and computation redundancy in graph processing. Finally, GRACE introduces a graph conflict board (GCB) to resolve data conflicts and a multiport cache to enhance parallel efficiency. Experimental results demonstrate that GRACE achieves <inline-formula> <tex-math>$7.1 times $ </tex-math></inline-formula> end-to-end performance speedup over CPU and <inline-formula> <tex-math>$1.8 times $ </tex-math></inline-formula> over GPU, as well as <inline-formula> <tex-math>$27.3 times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8.7 times $ </tex-math></inline-formula> energy efficiency over CPU and GPU. Moreover, GRACE delivers up to <inline-formula> <tex-math>$34.9 times $ </tex-math></inline-formula> performance speedup compared to the state-of-the-art FPGA accelerator.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3816-3829"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The arithmetic intensity (ArI) of different DNNs can be opposite. This challenges the generality of single acceleration architectures, including both dedicated on-chip processing (OCP) and near-data processing (NDP). Neither architecture can simultaneously achieve optimal energy efficiency and performance for operators with opposite ArI. It is relatively straightforward to think of combining the respective advantages of OCP and NDP. However, few publications have addressed their real-time co-optimization, primarily due to the lack of a quantifiable offloading method. Here, we propose GPOS, a general and precise offloading strategy that supports high generality of DNN acceleration. GPOS comprehensively considers the complex interactions between OCP and NDP, including hardware configurations, dataflow (DF), DNN model, and interdie data movements (DMs). Three quantifiable indicators—ArI, execution cost (Ex-cost), and DM-cost—are employed to precisely evaluate the impacts of these interactions on energy and latency. GPOS adopts a four-step flow with progressive refinement: each of the first three steps focuses on a single indicator at the operator level, while the final step performs context-based calibration to address operator interdependencies and avoid offsetting NDP benefits. Narrowing down offloading candidates in step 1 and step 3 significantly accelerates real-time quantitative analysis. Optimized mapping techniques and NDP-input stationary DF are proposed to reduce Ex-cost and extend operator types supported by NDP. Next, for the first time, sparsity—one of the most popular methods for energy optimization that can alter data reuse or ArI—is quantitatively investigated for its impacts on offloading using GPOS. Our evaluations include representative DNNs, including GPT-2, Bert, RNN, CNN, and MLP. GPOS achieves the minimum energy and latency for each benchmark, with geometric mean speedups of 49.0% and 94.1%, and geometric mean energy savings of 45.8% and 89.2% over All-OCP and All-NDP, respectively. GPOS also reduces offloading analysis latency by a geometric mean of 92.7% compared to the evaluation that traverses each operator and its relative combinations. On average, sparsity further improves performance and energy efficiency by increasing the number of operators offloaded to NDP. However, for DNNs where all operators exhibit either very high or very low ArI, the number of offloaded operators remains unchanged, even after sparsity is applied.
{"title":"GPOS: A General and Precise Offloading Strategy for High Generality of DNN Acceleration by OCP and NDP Co-Optimizing","authors":"Zixu Li;Wang Wang;Manni Li;Jiayu Yang;Zijian Huang;Xin Zhong;Yinyin Lin;Chengchen Wang;Xiankui Xiong","doi":"10.1109/TCAD.2025.3555184","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555184","url":null,"abstract":"The arithmetic intensity (ArI) of different DNNs can be opposite. This challenges the generality of single acceleration architectures, including both dedicated on-chip processing (OCP) and near-data processing (NDP). Neither architecture can simultaneously achieve optimal energy efficiency and performance for operators with opposite ArI. It is relatively straightforward to think of combining the respective advantages of OCP and NDP. However, few publications have addressed their real-time co-optimization, primarily due to the lack of a quantifiable offloading method. Here, we propose GPOS, a general and precise offloading strategy that supports high generality of DNN acceleration. GPOS comprehensively considers the complex interactions between OCP and NDP, including hardware configurations, dataflow (DF), DNN model, and interdie data movements (DMs). Three quantifiable indicators—ArI, execution cost (Ex-cost), and DM-cost—are employed to precisely evaluate the impacts of these interactions on energy and latency. GPOS adopts a four-step flow with progressive refinement: each of the first three steps focuses on a single indicator at the operator level, while the final step performs context-based calibration to address operator interdependencies and avoid offsetting NDP benefits. Narrowing down offloading candidates in step 1 and step 3 significantly accelerates real-time quantitative analysis. Optimized mapping techniques and NDP-input stationary DF are proposed to reduce Ex-cost and extend operator types supported by NDP. Next, for the first time, sparsity—one of the most popular methods for energy optimization that can alter data reuse or ArI—is quantitatively investigated for its impacts on offloading using GPOS. Our evaluations include representative DNNs, including GPT-2, Bert, RNN, CNN, and MLP. GPOS achieves the minimum energy and latency for each benchmark, with geometric mean speedups of 49.0% and 94.1%, and geometric mean energy savings of 45.8% and 89.2% over All-OCP and All-NDP, respectively. GPOS also reduces offloading analysis latency by a geometric mean of 92.7% compared to the evaluation that traverses each operator and its relative combinations. On average, sparsity further improves performance and energy efficiency by increasing the number of operators offloaded to NDP. However, for DNNs where all operators exhibit either very high or very low ArI, the number of offloaded operators remains unchanged, even after sparsity is applied.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3776-3789"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-25DOI: 10.1109/TCAD.2025.3554612
Alexandra K端ster;Rainer Dorsch;Christian Haubelt
The development of modern heterogeneous systems requires early integration of the various domains to improve and verify the design. Heterogeneous virtual prototypes are a key enabler to reach this goal. In order to efficiently support the development, their high simulation speed is of utmost importance. This article introduces measures to speed-up SystemC analog/mixed-signal (AMS) simulations which are commonly used to simulate the AMS part jointly with the digital prototype in SystemC. Two approaches to integrate variable-step ordinary differential equation solvers into the simulation semantics of SystemC AMS are presented. Both of them avoid global backtracking. One is well suited for feedback loops and the other is favorable for systems dynamically reacting onto events. Moreover, a timestep quantization is developed that overcomes the recurrent matrix inversion bottleneck of variable-step implicit solvers. A similar method is then used to increase the simulation speed of electrical linear network models with high switching activity. Various experiments from the context of smart sensors are presented which prove the effectiveness for enhancing the simulation speed.
{"title":"Toward Fast Heterogeneous Virtual Prototypes: Increasing the Solver Efficiency in SystemC AMS","authors":"Alexandra K端ster;Rainer Dorsch;Christian Haubelt","doi":"10.1109/TCAD.2025.3554612","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3554612","url":null,"abstract":"The development of modern heterogeneous systems requires early integration of the various domains to improve and verify the design. Heterogeneous virtual prototypes are a key enabler to reach this goal. In order to efficiently support the development, their high simulation speed is of utmost importance. This article introduces measures to speed-up SystemC analog/mixed-signal (AMS) simulations which are commonly used to simulate the AMS part jointly with the digital prototype in SystemC. Two approaches to integrate variable-step ordinary differential equation solvers into the simulation semantics of SystemC AMS are presented. Both of them avoid global backtracking. One is well suited for feedback loops and the other is favorable for systems dynamically reacting onto events. Moreover, a timestep quantization is developed that overcomes the recurrent matrix inversion bottleneck of variable-step implicit solvers. A similar method is then used to increase the simulation speed of electrical linear network models with high switching activity. Various experiments from the context of smart sensors are presented which prove the effectiveness for enhancing the simulation speed.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3868-3881"},"PeriodicalIF":2.9,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-24DOI: 10.1109/TCAD.2025.3554144
Jianhua Gao;Zhi Zhou;Xingze Huang;Juan Wang;Yizhuo Wang;Weixing Ji
The CPU-FPGA heterogeneous computing architecture is extensively employed in the embedded domain due to its low cost and power efficiency, with numerous sparse matrix-vector multiplication (SpMV) acceleration efforts already targeting this architecture. However, existing work rarely includes collaborative SpMV computations between CPU and FPGA, which limits the exploration of hybrid architectures that could potentially offer enhanced performance and flexibility. This article introduces an FPGA architecture design that supports multiprecision SpMV computations, including FP16, FP32, and FP64. Building on this, PTPS, a precision-aware SpMV task partitioning and dynamic scheduling algorithm tailored for the CPU-FPGA heterogeneous architecture, is proposed. The core idea of PTPS is lossless partitioning of sparse matrices across multiple precisions, prioritizing low-precision SpMV computations on the FPGA and high-precision computations on the CPU. PTPS not only leverages the strengths of CPU and FPGA for collaborative SpMV computations but also reduces data transmission overhead between them, thereby improving the overall computational efficiency. Experimental evaluation demonstrates that the proposed approach offers an average speedup of $1.57times $ over the CPU-only approach and $2.58times $ over the FPGA-only approach.
{"title":"PTPS: Precision-Aware Task Partitioning and Scheduling for SpMV on CPU-FPGA Heterogeneous Platforms","authors":"Jianhua Gao;Zhi Zhou;Xingze Huang;Juan Wang;Yizhuo Wang;Weixing Ji","doi":"10.1109/TCAD.2025.3554144","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3554144","url":null,"abstract":"The CPU-FPGA heterogeneous computing architecture is extensively employed in the embedded domain due to its low cost and power efficiency, with numerous sparse matrix-vector multiplication (SpMV) acceleration efforts already targeting this architecture. However, existing work rarely includes collaborative SpMV computations between CPU and FPGA, which limits the exploration of hybrid architectures that could potentially offer enhanced performance and flexibility. This article introduces an FPGA architecture design that supports multiprecision SpMV computations, including FP16, FP32, and FP64. Building on this, PTPS, a precision-aware SpMV task partitioning and dynamic scheduling algorithm tailored for the CPU-FPGA heterogeneous architecture, is proposed. The core idea of PTPS is lossless partitioning of sparse matrices across multiple precisions, prioritizing low-precision SpMV computations on the FPGA and high-precision computations on the CPU. PTPS not only leverages the strengths of CPU and FPGA for collaborative SpMV computations but also reduces data transmission overhead between them, thereby improving the overall computational efficiency. Experimental evaluation demonstrates that the proposed approach offers an average speedup of <inline-formula> <tex-math>$1.57times $ </tex-math></inline-formula> over the CPU-only approach and <inline-formula> <tex-math>$2.58times $ </tex-math></inline-formula> over the FPGA-only approach.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3804-3815"},"PeriodicalIF":2.9,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-23DOI: 10.1109/TCAD.2025.3573223
Irith Pomeranz
Ensuring correct functional operation of a chip requires extensive testing. Without the constraints of maintaining functional operation conditions, structural (scan-based) tests allow high-fault coverage to be achieved efficiently. To cover defects that are only exhibited under functional operation conditions, functional test sequences are used for complementing scan-based tests. One of the limitations of functional test sequences is their length, making it important to apply test compaction. To avoid losing the functional properties of a sequence when test compaction is applied at the gate level, design-for-testability (DFT) logic can be used for keeping the circuit in its functional state space. In this context, this article suggests the new concept of a modular functional test sequence consisting of subsequences that can be plugged in or out to increase the fault coverage or reduce the sequence length. To support modularity at the gate level, DFT logic is used for restoring functional states between subsequences. Modularity offers the key advantage that a single compact functional test sequence can be constructed from a given pool of functional test sequences, and the modular sequence can be updated as additional sequences become available in the pool, or additional fault models are targeted. The article develops a procedure for the generation and compaction of modular sequences using subsequences from a given pool, and presents experimental results for benchmark circuits in an academic simulation environment to demonstrate its effectiveness and limitations.
{"title":"Modular Functional Test Sequences for Test Compaction","authors":"Irith Pomeranz","doi":"10.1109/TCAD.2025.3573223","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3573223","url":null,"abstract":"Ensuring correct functional operation of a chip requires extensive testing. Without the constraints of maintaining functional operation conditions, structural (scan-based) tests allow high-fault coverage to be achieved efficiently. To cover defects that are only exhibited under functional operation conditions, functional test sequences are used for complementing scan-based tests. One of the limitations of functional test sequences is their length, making it important to apply test compaction. To avoid losing the functional properties of a sequence when test compaction is applied at the gate level, design-for-testability (DFT) logic can be used for keeping the circuit in its functional state space. In this context, this article suggests the new concept of a modular functional test sequence consisting of subsequences that can be plugged in or out to increase the fault coverage or reduce the sequence length. To support modularity at the gate level, DFT logic is used for restoring functional states between subsequences. Modularity offers the key advantage that a single compact functional test sequence can be constructed from a given pool of functional test sequences, and the modular sequence can be updated as additional sequences become available in the pool, or additional fault models are targeted. The article develops a procedure for the generation and compaction of modular sequences using subsequences from a given pool, and presents experimental results for benchmark circuits in an academic simulation environment to demonstrate its effectiveness and limitations.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"407-417"},"PeriodicalIF":2.9,"publicationDate":"2025-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-22DOI: 10.1109/TCAD.2025.3572838
Yuan Zhang;Kuncai Zhong;Jiliang Zhang
As a vital security primitive, the true random number generator (TRNG) is a mandatory component to build trust roots for any encryption system. However, existing TRNGs suffer from bottlenecks of low throughput and high area-energy consumption. Additionally, the electronic design automation (EDA) design of TRNG for specific applications remains an unexplored area. To address these issues, in this work, we propose compact and high-throughput TRNGs based on dynamic hybrid, reinforcement strategies, and automated exploration. First, we present a dynamic hybrid entropy unit and reinforcement strategies to provide sufficient randomness. On this basis, we propose a high-efficiency dynamic hybrid TRNG (DH-TRNG) architecture. It exhibits portability to distinct process field programmable gate arrays (FPGAs) and passes both NIST and AIS-31 tests without any post-processing. The experiments show it incurs only 8 slices with the highest throughput of 670 and 620 Mb/s on Xilinx Virtex-6 and Artix-7, respectively. Compared to the state-of-the-art TRNGs, DH-TRNG has the highest (Throughput/Slices·Power) with $2.63times $ increase. In addition, we propose an automated exploration scheme as a preliminary EDA design for TRNG to better apply to resource-constrained scenarios. This scheme automatically explores TRNGs to meet the design requirements and further reduces the hardware overhead, indicating broad application prospects in TRNG automation design. Finally, we apply the proposed DH-TRNG and the results of automated exploration to stochastic computing (SC) for edge detection, achieving promising outcomes.
{"title":"High Throughput and Compact FPGA TRNGs Based on Hybrid Entropy, Reinforcement Strategies, and Automated Exploration","authors":"Yuan Zhang;Kuncai Zhong;Jiliang Zhang","doi":"10.1109/TCAD.2025.3572838","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3572838","url":null,"abstract":"As a vital security primitive, the true random number generator (TRNG) is a mandatory component to build trust roots for any encryption system. However, existing TRNGs suffer from bottlenecks of low throughput and high area-energy consumption. Additionally, the electronic design automation (EDA) design of TRNG for specific applications remains an unexplored area. To address these issues, in this work, we propose compact and high-throughput TRNGs based on dynamic hybrid, reinforcement strategies, and automated exploration. First, we present a dynamic hybrid entropy unit and reinforcement strategies to provide sufficient randomness. On this basis, we propose a high-efficiency dynamic hybrid TRNG (DH-TRNG) architecture. It exhibits portability to distinct process field programmable gate arrays (FPGAs) and passes both NIST and AIS-31 tests without any post-processing. The experiments show it incurs only 8 slices with the highest throughput of 670 and 620 Mb/s on Xilinx Virtex-6 and Artix-7, respectively. Compared to the state-of-the-art TRNGs, DH-TRNG has the highest (Throughput/Slices·Power) with <inline-formula> <tex-math>$2.63times $ </tex-math></inline-formula> increase. In addition, we propose an automated exploration scheme as a preliminary EDA design for TRNG to better apply to resource-constrained scenarios. This scheme automatically explores TRNGs to meet the design requirements and further reduces the hardware overhead, indicating broad application prospects in TRNG automation design. Finally, we apply the proposed DH-TRNG and the results of automated exploration to stochastic computing (SC) for edge detection, achieving promising outcomes.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"519-532"},"PeriodicalIF":2.9,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}