首页 > 最新文献

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems最新文献

英文 中文
OpenLS-DGF: An Adaptive Open-Source Dataset Generation Framework for Machine-Learning Tasks in Logic Synthesis OpenLS-DGF:用于逻辑综合中机器学习任务的自适应开源数据集生成框架
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-27 DOI: 10.1109/TCAD.2025.3555506
Liwei Ni;Rui Wang;Miao Liu;Xingyu Meng;Xiaoze Lin;Junfeng Liu;Guojie Luo;Zhufei Chu;Weikang Qian;Xiaoyan Yang;Biwei Xie;Xingquan Li;Huawei Li
This article introduces OpenLS-DGF, an adaptive logic synthesis dataset generation framework, to enhance machine-learning (ML) applications within the logic synthesis process. Previous dataset generation flows were tailored for specific tasks or lacked integrated ML capabilities. While OpenLS-DGF supports various ML tasks by encapsulating the three fundamental steps of logic synthesis: 1) Boolean representation; 2) logic optimization; and 3) technology mapping. It preserves the original information in both Verilog and ML-friendly GraphML formats. The Verilog files offer semi-customizable capabilities, enabling researchers to insert additional steps and incrementally refine the generated dataset. Furthermore, OpenLS-DGF includes an adaptive circuit engine that facilitates the final dataset management and downstream tasks. The generated OpenLS-D-v1 dataset comprises 46 combinational designs from established benchmarks, totaling over 966 000 Boolean circuits. OpenLS-D-v1 supports integrating new data features, making it more versatile for new tasks. This article demonstrates the versatility of OpenLS-D-v1 through four distinct downstream tasks: circuit classification, circuit ranking, quality of results (QoR) prediction, and probability prediction. Each task is chosen to represent essential steps of logic synthesis, and the experimental results show the generated dataset from OpenLS-DGF achieves prominent diversity and applicability. The source code and datasets are available at https://github.com/Logic-Factory/ACE/blob/master/OpenLS-DGF.
本文介绍了自适应逻辑合成数据集生成框架OpenLS-DGF,以增强逻辑合成过程中的机器学习(ML)应用。以前的数据集生成流程是为特定任务量身定制的,或者缺乏集成的ML功能。而OpenLS-DGF通过封装逻辑合成的三个基本步骤来支持各种ML任务:1)布尔表示;2)逻辑优化;3)技术映射。它以Verilog和ml友好的GraphML格式保存原始信息。Verilog文件提供了半可定制的功能,使研究人员能够插入额外的步骤并逐步完善生成的数据集。此外,OpenLS-DGF还包括一个自适应电路引擎,便于最终数据集管理和下游任务。生成的OpenLS-D-v1数据集包括来自已建立基准的46种组合设计,总计超过966,000个布尔电路。OpenLS-D-v1支持集成新的数据特性,使其更适用于新任务。本文通过四个不同的下游任务演示了OpenLS-D-v1的多功能性:电路分类、电路排序、结果质量(QoR)预测和概率预测。实验结果表明,OpenLS-DGF生成的数据集具有突出的多样性和适用性。源代码和数据集可从https://github.com/Logic-Factory/ACE/blob/master/OpenLS-DGF获得。
{"title":"OpenLS-DGF: An Adaptive Open-Source Dataset Generation Framework for Machine-Learning Tasks in Logic Synthesis","authors":"Liwei Ni;Rui Wang;Miao Liu;Xingyu Meng;Xiaoze Lin;Junfeng Liu;Guojie Luo;Zhufei Chu;Weikang Qian;Xiaoyan Yang;Biwei Xie;Xingquan Li;Huawei Li","doi":"10.1109/TCAD.2025.3555506","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555506","url":null,"abstract":"This article introduces OpenLS-DGF, an adaptive logic synthesis dataset generation framework, to enhance machine-learning (ML) applications within the logic synthesis process. Previous dataset generation flows were tailored for specific tasks or lacked integrated ML capabilities. While OpenLS-DGF supports various ML tasks by encapsulating the three fundamental steps of logic synthesis: 1) Boolean representation; 2) logic optimization; and 3) technology mapping. It preserves the original information in both Verilog and ML-friendly GraphML formats. The Verilog files offer semi-customizable capabilities, enabling researchers to insert additional steps and incrementally refine the generated dataset. Furthermore, OpenLS-DGF includes an adaptive circuit engine that facilitates the final dataset management and downstream tasks. The generated OpenLS-D-v1 dataset comprises 46 combinational designs from established benchmarks, totaling over 966 000 Boolean circuits. OpenLS-D-v1 supports integrating new data features, making it more versatile for new tasks. This article demonstrates the versatility of OpenLS-D-v1 through four distinct downstream tasks: circuit classification, circuit ranking, quality of results (QoR) prediction, and probability prediction. Each task is chosen to represent essential steps of logic synthesis, and the experimental results show the generated dataset from OpenLS-DGF achieves prominent diversity and applicability. The source code and datasets are available at <uri>https://github.com/Logic-Factory/ACE/blob/master/OpenLS-DGF</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3830-3843"},"PeriodicalIF":2.9,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Formal Synthesis of Neural Barrier Certificates for Dynamical Systems via DC Programming 基于DC规划的动力系统神经屏障证书形式化综合
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-27 DOI: 10.1109/TCAD.2025.3555513
Yang Wang;Hanlong Chen;Wang Lin;Zuohua Ding
Barrier certificate generation is an ingenious and powerful approach for safety verification of cyber-physical systems. This article suggests a new learning and verification framework that helps to achieve the balance between the representation ability and the verification efficiency for neural barrier certificates. In the learning phase, it learns candidate barrier certificates represented as convex difference neural networks (CDiNNs). Since CDiNNs can be rewritten as difference of convex (DC) functions that can express any twice differentiable function, thus have outstanding representation ability and flexibility. In the verification phase, it employs an efficient approach for formally verifying the validity of the neural candidates via DC programming. Due to the convexity-based structure, CDiNNs can significantly facilitate the verification process. We conduct an experimental evaluation over a set of benchmarks, which validates that our method is much more efficient and effective than the state-of-the-art approaches.
屏障证书生成是一种巧妙而强大的网络物理系统安全验证方法。本文提出了一种新的学习和验证框架,以实现神经屏障证书的表示能力和验证效率之间的平衡。在学习阶段,它学习用凸差神经网络(cdinn)表示的候选障碍证书。由于cdinn可以重写为可以表示任意二次可微函数的凸差分(DC)函数,因此具有出色的表示能力和灵活性。在验证阶段,采用一种有效的方法,通过DC编程对候选神经网络的有效性进行形式化验证。由于基于凸性的结构,cdinn可以大大简化验证过程。我们对一组基准进行了实验评估,验证了我们的方法比最先进的方法更高效和有效。
{"title":"Formal Synthesis of Neural Barrier Certificates for Dynamical Systems via DC Programming","authors":"Yang Wang;Hanlong Chen;Wang Lin;Zuohua Ding","doi":"10.1109/TCAD.2025.3555513","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555513","url":null,"abstract":"Barrier certificate generation is an ingenious and powerful approach for safety verification of cyber-physical systems. This article suggests a new learning and verification framework that helps to achieve the balance between the representation ability and the verification efficiency for neural barrier certificates. In the learning phase, it learns candidate barrier certificates represented as convex difference neural networks (CDiNNs). Since CDiNNs can be rewritten as difference of convex (DC) functions that can express any twice differentiable function, thus have outstanding representation ability and flexibility. In the verification phase, it employs an efficient approach for formally verifying the validity of the neural candidates via DC programming. Due to the convexity-based structure, CDiNNs can significantly facilitate the verification process. We conduct an experimental evaluation over a set of benchmarks, which validates that our method is much more efficient and effective than the state-of-the-art approaches.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"4038-4042"},"PeriodicalIF":2.9,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145100356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
InstantGR: Scalable GPU Parallelization for 3-D Global Routing InstantGR:三维全局路由的可扩展GPU并行化
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-26 DOI: 10.1109/TCAD.2025.3573685
Liang Xiao;Shiju Lin;Jinwei Liu;Qinkai Duan;Tsung-Yi Ho;Evangeline F. Y. Young
Global routing plays a crucial role in electronic design automation (EDA), serving not only as a means of optimizing routing but also as a tool for estimating routability in earlier stages, such as logic synthesis and physical planning. However, these scenarios often require global routing on unpartitioned large designs, posing unique challenges in scalability, both in terms of runtime and design size. To tackle this issue, this article introduces useful techniques for parallelizing large-scale global routing that can significantly increase parallelism and thus reduce runtime. We also propose a new flexible layer transition technique to increase the flexibility and routing quality of directed acyclic graph (DAG) routing. Building upon these techniques, we have developed an open-source GPU-based global router that achieves state-of-the-art results in the latest ISPD’24 Contest benchmarks, thereby showcasing the effectiveness of our methods.
全局路由在电子设计自动化(EDA)中起着至关重要的作用,它不仅是优化路由的一种手段,而且是在逻辑综合和物理规划等早期阶段估计可达性的工具。然而,这些场景通常需要在未分区的大型设计上进行全局路由,这在可伸缩性方面提出了独特的挑战,无论是在运行时还是设计大小方面。为了解决这个问题,本文介绍了并行化大规模全局路由的有用技术,这些技术可以显著提高并行性,从而减少运行时间。为了提高有向无环图(DAG)路由的灵活性和路由质量,提出了一种新的柔性层转换技术。在这些技术的基础上,我们开发了一个基于gpu的开源全球路由器,在最新的ISPD ' 24竞赛基准测试中取得了最先进的结果,从而展示了我们方法的有效性。
{"title":"InstantGR: Scalable GPU Parallelization for 3-D Global Routing","authors":"Liang Xiao;Shiju Lin;Jinwei Liu;Qinkai Duan;Tsung-Yi Ho;Evangeline F. Y. Young","doi":"10.1109/TCAD.2025.3573685","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3573685","url":null,"abstract":"Global routing plays a crucial role in electronic design automation (EDA), serving not only as a means of optimizing routing but also as a tool for estimating routability in earlier stages, such as logic synthesis and physical planning. However, these scenarios often require global routing on unpartitioned large designs, posing unique challenges in scalability, both in terms of runtime and design size. To tackle this issue, this article introduces useful techniques for parallelizing large-scale global routing that can significantly increase parallelism and thus reduce runtime. We also propose a new flexible layer transition technique to increase the flexibility and routing quality of directed acyclic graph (DAG) routing. Building upon these techniques, we have developed an open-source GPU-based global router that achieves state-of-the-art results in the latest ISPD’24 Contest benchmarks, thereby showcasing the effectiveness of our methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"441-452"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11015529","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MCHEAS: Optimizing Large-Parameter NTT Over Multicluster In-Situ FHE Accelerating System 基于多簇原位FHE加速系统的大参数NTT优化
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-26 DOI: 10.1109/TCAD.2025.3555191
Zhenyu Guan;Yongqing Zhu;Luchang Lei;Hongyang Jia;Yi Chen;Bo Zhang;Changrui Ren;Jin Dong;Song Bian
Fully Homomorphic encryption (FHE) enables high-level security but with a heavy computation workload, necessitating software-hardware co-design for aggressive acceleration. Recent works on specialized accelerators for HE evaluation have made significant progress in supporting lightweight RNS-CKKS applications, especially those with high-density in-memory computing techniques. To fulfill higher computational demands for more general applications, this article proposes multicluster HE accelerating system (MCHEAS), an accelerating system comprising multiple in-situ HE processing accelerators, each functioning as a cluster to perform large-parameter RNS-CKKS evaluation collaboratively. MCHEAS features optimization strategies including the synchronous, preemptive swap, square-diagonal, and odd-even index separation. Using these strategies to compile the computation and transmission of number theoretic transform (NTT) coefficients, the method optimizes the intercluster data swaps, a major bottleneck in NTT computations. Evaluations show that under 1 GHz, with different intercluster data transfer bandwidths, our approach accelerates NTT computations by 26.40% to 51.75%. MCHEAS also improves computing unit utilization by 10.30% to 33.97%, with a maximum peak utilization rate of up to 99.62%. MCHEAS achieves 17.63% to 34.67% speedups for HE operations involving NTT, and 15.12% to 30.62% speedups for demonstrated applications, while enhancing the computing units’ utilization by 5.18% to 21.87% during application execution. Furthermore, we compare MCHEAS with SOTA designs under a specific intercluster data transfer bandwidth, achieving up to $81.45times $ their area efficiencies in applications.
完全同态加密(FHE)实现了高级别安全性,但计算工作量很大,需要软件和硬件协同设计来实现积极的加速。最近关于HE评估专用加速器的工作在支持轻量级RNS-CKKS应用方面取得了重大进展,特别是那些具有高密度内存计算技术的应用。为了满足更广泛应用的更高计算需求,本文提出了多集群HE加速系统(MCHEAS),这是一个由多个原位HE处理加速器组成的加速系统,每个加速器作为一个集群协同执行大参数RNS-CKKS评估。MCHEAS的优化策略包括同步、抢占式交换、平方对角线和奇偶索引分离。利用这些策略编译数论变换(NTT)系数的计算和传输,优化了集群间的数据交换,这是NTT计算的主要瓶颈。评估表明,在1 GHz下,在不同的集群间数据传输带宽下,我们的方法将NTT计算速度提高了26.40%至51.75%。MCHEAS还将计算单元利用率提高了10.30% ~ 33.97%,最大峰值利用率高达99.62%。MCHEAS在涉及NTT的HE操作中实现了17.63%到34.67%的加速,在演示应用程序中实现了15.12%到30.62%的加速,同时在应用程序执行期间将计算单元的利用率提高了5.18%到21.87%。此外,在特定的集群间数据传输带宽下,我们将MCHEAS与SOTA设计进行了比较,在应用中实现了高达81.45倍的区域效率。
{"title":"MCHEAS: Optimizing Large-Parameter NTT Over Multicluster In-Situ FHE Accelerating System","authors":"Zhenyu Guan;Yongqing Zhu;Luchang Lei;Hongyang Jia;Yi Chen;Bo Zhang;Changrui Ren;Jin Dong;Song Bian","doi":"10.1109/TCAD.2025.3555191","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555191","url":null,"abstract":"Fully Homomorphic encryption (FHE) enables high-level security but with a heavy computation workload, necessitating software-hardware co-design for aggressive acceleration. Recent works on specialized accelerators for HE evaluation have made significant progress in supporting lightweight RNS-CKKS applications, especially those with high-density in-memory computing techniques. To fulfill higher computational demands for more general applications, this article proposes multicluster HE accelerating system (MCHEAS), an accelerating system comprising multiple in-situ HE processing accelerators, each functioning as a cluster to perform large-parameter RNS-CKKS evaluation collaboratively. MCHEAS features optimization strategies including the synchronous, preemptive swap, square-diagonal, and odd-even index separation. Using these strategies to compile the computation and transmission of number theoretic transform (NTT) coefficients, the method optimizes the intercluster data swaps, a major bottleneck in NTT computations. Evaluations show that under 1 GHz, with different intercluster data transfer bandwidths, our approach accelerates NTT computations by 26.40% to 51.75%. MCHEAS also improves computing unit utilization by 10.30% to 33.97%, with a maximum peak utilization rate of up to 99.62%. MCHEAS achieves 17.63% to 34.67% speedups for HE operations involving NTT, and 15.12% to 30.62% speedups for demonstrated applications, while enhancing the computing units’ utilization by 5.18% to 21.87% during application execution. Furthermore, we compare MCHEAS with SOTA designs under a specific intercluster data transfer bandwidth, achieving up to <inline-formula> <tex-math>$81.45times $ </tex-math></inline-formula> their area efficiencies in applications.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3683-3696"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GRACE: An End-to-End Graph Processing Accelerator on FPGA With Graph Reordering Engine 基于FPGA的端到端图形处理加速器与图形重排序引擎
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-26 DOI: 10.1109/TCAD.2025.3555192
Haishuang Fan;Rui Meng;Qichu Sun;Jingya Wu;Wenyan Lu;Xiaowei Li;Guihai Yan
Graphs play an important role in various applications. With the rapid expansion of vertices in real life, existing large-scale graph processing frameworks on CPUs and GPUs encounter challenges in optimizing cache usage due to irregular memory access patterns. To address this, graph reordering has been proposed to improve the locality of the graph, but introduces significant overhead without delivering substantial end-to-end performance improvement. While there have been many FPGA-based accelerators for graph processing, achieving high throughput often requires complex graph prepossessing on CPUs. Therefore, implementing an efficient end-to-end graph processing system remains challenging. This article introduces GRACE, an end-to-end FPGA-based graph processing accelerator with a graph reordering engine and a pull-based vertex-centric programming model (PL-VCPM) Engine. First, GRACE employs a customized high-degree vertex cache (HDC) to improve memory access efficiency. Second, GRACE offloads the graph preprocessing to FPGA. We customize an efficient graph reordering engine to complete preprocessing. Third, GRACE adopts a graph pruning strategy to remove the activation and computation redundancy in graph processing. Finally, GRACE introduces a graph conflict board (GCB) to resolve data conflicts and a multiport cache to enhance parallel efficiency. Experimental results demonstrate that GRACE achieves $7.1 times $ end-to-end performance speedup over CPU and $1.8 times $ over GPU, as well as $27.3 times $ and $8.7 times $ energy efficiency over CPU and GPU. Moreover, GRACE delivers up to $34.9 times $ performance speedup compared to the state-of-the-art FPGA accelerator.
图在各种应用程序中扮演着重要的角色。随着现实生活中顶点的快速扩展,现有的基于cpu和gpu的大规模图形处理框架由于内存访问模式不规范,在优化缓存使用方面面临挑战。为了解决这个问题,已经提出了图重新排序来改进图的局部性,但是引入了巨大的开销,而没有提供实质性的端到端性能改进。虽然有许多基于fpga的图形处理加速器,但实现高吞吐量通常需要在cpu上处理复杂的图形。因此,实现一个高效的端到端图形处理系统仍然具有挑战性。本文介绍了GRACE,一个端到端基于fpga的图形处理加速器,具有图形重排序引擎和基于拉的以顶点为中心的编程模型(PL-VCPM)引擎。首先,GRACE采用定制的高度顶点缓存(HDC)来提高内存访问效率。其次,GRACE将图形预处理工作卸载到FPGA上。我们定制了一个高效的图重排序引擎来完成预处理。第三,GRACE采用图剪枝策略去除图处理中的激活冗余和计算冗余。最后,GRACE引入了图形冲突板(GCB)来解决数据冲突,并引入了多端口缓存来提高并行效率。实验结果表明,GRACE的端到端性能比CPU提高了7.1倍,比GPU提高了1.8倍,比CPU和GPU分别提高了27.3倍和8.7倍。此外,与最先进的FPGA加速器相比,GRACE提供高达34.9倍的性能加速。
{"title":"GRACE: An End-to-End Graph Processing Accelerator on FPGA With Graph Reordering Engine","authors":"Haishuang Fan;Rui Meng;Qichu Sun;Jingya Wu;Wenyan Lu;Xiaowei Li;Guihai Yan","doi":"10.1109/TCAD.2025.3555192","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555192","url":null,"abstract":"Graphs play an important role in various applications. With the rapid expansion of vertices in real life, existing large-scale graph processing frameworks on CPUs and GPUs encounter challenges in optimizing cache usage due to irregular memory access patterns. To address this, graph reordering has been proposed to improve the locality of the graph, but introduces significant overhead without delivering substantial end-to-end performance improvement. While there have been many FPGA-based accelerators for graph processing, achieving high throughput often requires complex graph prepossessing on CPUs. Therefore, implementing an efficient end-to-end graph processing system remains challenging. This article introduces GRACE, an end-to-end FPGA-based graph processing accelerator with a graph reordering engine and a pull-based vertex-centric programming model (PL-VCPM) Engine. First, GRACE employs a customized high-degree vertex cache (HDC) to improve memory access efficiency. Second, GRACE offloads the graph preprocessing to FPGA. We customize an efficient graph reordering engine to complete preprocessing. Third, GRACE adopts a graph pruning strategy to remove the activation and computation redundancy in graph processing. Finally, GRACE introduces a graph conflict board (GCB) to resolve data conflicts and a multiport cache to enhance parallel efficiency. Experimental results demonstrate that GRACE achieves <inline-formula> <tex-math>$7.1 times $ </tex-math></inline-formula> end-to-end performance speedup over CPU and <inline-formula> <tex-math>$1.8 times $ </tex-math></inline-formula> over GPU, as well as <inline-formula> <tex-math>$27.3 times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8.7 times $ </tex-math></inline-formula> energy efficiency over CPU and GPU. Moreover, GRACE delivers up to <inline-formula> <tex-math>$34.9 times $ </tex-math></inline-formula> performance speedup compared to the state-of-the-art FPGA accelerator.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3816-3829"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPOS: A General and Precise Offloading Strategy for High Generality of DNN Acceleration by OCP and NDP Co-Optimizing GPOS:一种基于OCP和NDP协同优化的DNN加速高通用性的通用精确卸载策略
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-26 DOI: 10.1109/TCAD.2025.3555184
Zixu Li;Wang Wang;Manni Li;Jiayu Yang;Zijian Huang;Xin Zhong;Yinyin Lin;Chengchen Wang;Xiankui Xiong
The arithmetic intensity (ArI) of different DNNs can be opposite. This challenges the generality of single acceleration architectures, including both dedicated on-chip processing (OCP) and near-data processing (NDP). Neither architecture can simultaneously achieve optimal energy efficiency and performance for operators with opposite ArI. It is relatively straightforward to think of combining the respective advantages of OCP and NDP. However, few publications have addressed their real-time co-optimization, primarily due to the lack of a quantifiable offloading method. Here, we propose GPOS, a general and precise offloading strategy that supports high generality of DNN acceleration. GPOS comprehensively considers the complex interactions between OCP and NDP, including hardware configurations, dataflow (DF), DNN model, and interdie data movements (DMs). Three quantifiable indicators—ArI, execution cost (Ex-cost), and DM-cost—are employed to precisely evaluate the impacts of these interactions on energy and latency. GPOS adopts a four-step flow with progressive refinement: each of the first three steps focuses on a single indicator at the operator level, while the final step performs context-based calibration to address operator interdependencies and avoid offsetting NDP benefits. Narrowing down offloading candidates in step 1 and step 3 significantly accelerates real-time quantitative analysis. Optimized mapping techniques and NDP-input stationary DF are proposed to reduce Ex-cost and extend operator types supported by NDP. Next, for the first time, sparsity—one of the most popular methods for energy optimization that can alter data reuse or ArI—is quantitatively investigated for its impacts on offloading using GPOS. Our evaluations include representative DNNs, including GPT-2, Bert, RNN, CNN, and MLP. GPOS achieves the minimum energy and latency for each benchmark, with geometric mean speedups of 49.0% and 94.1%, and geometric mean energy savings of 45.8% and 89.2% over All-OCP and All-NDP, respectively. GPOS also reduces offloading analysis latency by a geometric mean of 92.7% compared to the evaluation that traverses each operator and its relative combinations. On average, sparsity further improves performance and energy efficiency by increasing the number of operators offloaded to NDP. However, for DNNs where all operators exhibit either very high or very low ArI, the number of offloaded operators remains unchanged, even after sparsity is applied.
不同深度神经网络的算术强度(ArI)可能相反。这挑战了单加速架构的通用性,包括专用片上处理(OCP)和近数据处理(NDP)。对于ArI相反的操作人员,这两种架构都不能同时达到最佳的能效和性能。将OCP和NDP各自的优势结合起来是相对简单的。然而,很少有出版物解决了他们的实时协同优化,主要是由于缺乏可量化的卸载方法。在这里,我们提出了GPOS,一种通用和精确的卸载策略,支持深度神经网络加速的高通用性。GPOS综合考虑了OCP和NDP之间的复杂交互,包括硬件配置、数据流(DF)、DNN模型和数据移动(dm)。采用ari、执行成本(Ex-cost)和dm -cost三个可量化指标来精确评估这些相互作用对能量和延迟的影响。GPOS采用了四步逐步改进流程:前三步中的每一步都侧重于操作人员级别的单个指标,而最后一步执行基于上下文的校准,以解决操作人员的相互依赖关系,并避免抵消NDP的好处。在步骤1和步骤3中缩小卸载候选对象的范围可以显著加快实时定量分析。提出了优化的映射技术和NDP输入的平稳DF,以减少ex成本和扩展NDP支持的算子类型。接下来,我们将首次定量研究稀疏性(可以改变数据重用或ari的最流行的能源优化方法之一)对使用GPOS卸载的影响。我们的评估包括代表性的深度神经网络,包括GPT-2、Bert、RNN、CNN和MLP。与All-OCP和All-NDP相比,GPOS在每个基准测试中实现了最小的能量和延迟,几何平均加速分别为49.0%和94.1%,几何平均节能分别为45.8%和89.2%。与遍历每个操作符及其相关组合的评估相比,GPOS还将卸载分析延迟减少了92.7%的几何平均值。平均而言,稀疏性通过增加卸载到NDP的作业者数量,进一步提高了性能和能源效率。然而,对于所有操作符都表现出非常高或非常低ArI的dnn,即使在应用稀疏性之后,卸载操作符的数量仍然保持不变。
{"title":"GPOS: A General and Precise Offloading Strategy for High Generality of DNN Acceleration by OCP and NDP Co-Optimizing","authors":"Zixu Li;Wang Wang;Manni Li;Jiayu Yang;Zijian Huang;Xin Zhong;Yinyin Lin;Chengchen Wang;Xiankui Xiong","doi":"10.1109/TCAD.2025.3555184","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555184","url":null,"abstract":"The arithmetic intensity (ArI) of different DNNs can be opposite. This challenges the generality of single acceleration architectures, including both dedicated on-chip processing (OCP) and near-data processing (NDP). Neither architecture can simultaneously achieve optimal energy efficiency and performance for operators with opposite ArI. It is relatively straightforward to think of combining the respective advantages of OCP and NDP. However, few publications have addressed their real-time co-optimization, primarily due to the lack of a quantifiable offloading method. Here, we propose GPOS, a general and precise offloading strategy that supports high generality of DNN acceleration. GPOS comprehensively considers the complex interactions between OCP and NDP, including hardware configurations, dataflow (DF), DNN model, and interdie data movements (DMs). Three quantifiable indicators—ArI, execution cost (Ex-cost), and DM-cost—are employed to precisely evaluate the impacts of these interactions on energy and latency. GPOS adopts a four-step flow with progressive refinement: each of the first three steps focuses on a single indicator at the operator level, while the final step performs context-based calibration to address operator interdependencies and avoid offsetting NDP benefits. Narrowing down offloading candidates in step 1 and step 3 significantly accelerates real-time quantitative analysis. Optimized mapping techniques and NDP-input stationary DF are proposed to reduce Ex-cost and extend operator types supported by NDP. Next, for the first time, sparsity—one of the most popular methods for energy optimization that can alter data reuse or ArI—is quantitatively investigated for its impacts on offloading using GPOS. Our evaluations include representative DNNs, including GPT-2, Bert, RNN, CNN, and MLP. GPOS achieves the minimum energy and latency for each benchmark, with geometric mean speedups of 49.0% and 94.1%, and geometric mean energy savings of 45.8% and 89.2% over All-OCP and All-NDP, respectively. GPOS also reduces offloading analysis latency by a geometric mean of 92.7% compared to the evaluation that traverses each operator and its relative combinations. On average, sparsity further improves performance and energy efficiency by increasing the number of operators offloaded to NDP. However, for DNNs where all operators exhibit either very high or very low ArI, the number of offloaded operators remains unchanged, even after sparsity is applied.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3776-3789"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Fast Heterogeneous Virtual Prototypes: Increasing the Solver Efficiency in SystemC AMS 面向快速异构虚拟原型:提高SystemC AMS的求解效率
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-25 DOI: 10.1109/TCAD.2025.3554612
Alexandra K端ster;Rainer Dorsch;Christian Haubelt
The development of modern heterogeneous systems requires early integration of the various domains to improve and verify the design. Heterogeneous virtual prototypes are a key enabler to reach this goal. In order to efficiently support the development, their high simulation speed is of utmost importance. This article introduces measures to speed-up SystemC analog/mixed-signal (AMS) simulations which are commonly used to simulate the AMS part jointly with the digital prototype in SystemC. Two approaches to integrate variable-step ordinary differential equation solvers into the simulation semantics of SystemC AMS are presented. Both of them avoid global backtracking. One is well suited for feedback loops and the other is favorable for systems dynamically reacting onto events. Moreover, a timestep quantization is developed that overcomes the recurrent matrix inversion bottleneck of variable-step implicit solvers. A similar method is then used to increase the simulation speed of electrical linear network models with high switching activity. Various experiments from the context of smart sensors are presented which prove the effectiveness for enhancing the simulation speed.
现代异构系统的开发需要早期集成各个领域,以改进和验证设计。异构虚拟原型是实现这一目标的关键。为了有效地支持开发,它们的高仿真速度至关重要。本文介绍了在SystemC中常用的模拟/混合信号(AMS)仿真方法,并结合数字样机对AMS部分进行仿真。提出了将变步长常微分方程求解器集成到SystemC AMS仿真语义中的两种方法。它们都避免了全局回溯。一种非常适合于反馈循环,另一种适合于系统对事件的动态反应。此外,还提出了一种时间步量化方法,克服了变步隐式解的循环矩阵反演瓶颈。然后采用类似的方法来提高具有高开关活度的电线性网络模型的仿真速度。在智能传感器的背景下进行了各种实验,证明了提高仿真速度的有效性。
{"title":"Toward Fast Heterogeneous Virtual Prototypes: Increasing the Solver Efficiency in SystemC AMS","authors":"Alexandra K端ster;Rainer Dorsch;Christian Haubelt","doi":"10.1109/TCAD.2025.3554612","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3554612","url":null,"abstract":"The development of modern heterogeneous systems requires early integration of the various domains to improve and verify the design. Heterogeneous virtual prototypes are a key enabler to reach this goal. In order to efficiently support the development, their high simulation speed is of utmost importance. This article introduces measures to speed-up SystemC analog/mixed-signal (AMS) simulations which are commonly used to simulate the AMS part jointly with the digital prototype in SystemC. Two approaches to integrate variable-step ordinary differential equation solvers into the simulation semantics of SystemC AMS are presented. Both of them avoid global backtracking. One is well suited for feedback loops and the other is favorable for systems dynamically reacting onto events. Moreover, a timestep quantization is developed that overcomes the recurrent matrix inversion bottleneck of variable-step implicit solvers. A similar method is then used to increase the simulation speed of electrical linear network models with high switching activity. Various experiments from the context of smart sensors are presented which prove the effectiveness for enhancing the simulation speed.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3868-3881"},"PeriodicalIF":2.9,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PTPS: Precision-Aware Task Partitioning and Scheduling for SpMV on CPU-FPGA Heterogeneous Platforms PTPS: CPU-FPGA异构平台上SpMV的精度感知任务划分和调度
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-24 DOI: 10.1109/TCAD.2025.3554144
Jianhua Gao;Zhi Zhou;Xingze Huang;Juan Wang;Yizhuo Wang;Weixing Ji
The CPU-FPGA heterogeneous computing architecture is extensively employed in the embedded domain due to its low cost and power efficiency, with numerous sparse matrix-vector multiplication (SpMV) acceleration efforts already targeting this architecture. However, existing work rarely includes collaborative SpMV computations between CPU and FPGA, which limits the exploration of hybrid architectures that could potentially offer enhanced performance and flexibility. This article introduces an FPGA architecture design that supports multiprecision SpMV computations, including FP16, FP32, and FP64. Building on this, PTPS, a precision-aware SpMV task partitioning and dynamic scheduling algorithm tailored for the CPU-FPGA heterogeneous architecture, is proposed. The core idea of PTPS is lossless partitioning of sparse matrices across multiple precisions, prioritizing low-precision SpMV computations on the FPGA and high-precision computations on the CPU. PTPS not only leverages the strengths of CPU and FPGA for collaborative SpMV computations but also reduces data transmission overhead between them, thereby improving the overall computational efficiency. Experimental evaluation demonstrates that the proposed approach offers an average speedup of $1.57times $ over the CPU-only approach and $2.58times $ over the FPGA-only approach.
CPU-FPGA异构计算架构由于其低成本和低功耗而广泛应用于嵌入式领域,许多稀疏矩阵向量乘法(SpMV)加速工作已经针对该架构。然而,现有的工作很少包括CPU和FPGA之间的协同SpMV计算,这限制了对可能提供增强性能和灵活性的混合架构的探索。本文介绍了一种支持多精度SpMV计算的FPGA架构设计,包括FP16、FP32和FP64。在此基础上,提出了一种适合CPU-FPGA异构架构的精确感知SpMV任务划分和动态调度算法PTPS。PTPS的核心思想是跨多个精度对稀疏矩阵进行无损划分,优先考虑FPGA上的低精度SpMV计算和CPU上的高精度计算。PTPS不仅可以利用CPU和FPGA的优势进行协同SpMV计算,还可以减少两者之间的数据传输开销,从而提高整体计算效率。实验评估表明,所提出的方法比纯cpu方法提供了1.57倍的平均加速,比纯fpga方法提供了2.58倍的平均加速。
{"title":"PTPS: Precision-Aware Task Partitioning and Scheduling for SpMV on CPU-FPGA Heterogeneous Platforms","authors":"Jianhua Gao;Zhi Zhou;Xingze Huang;Juan Wang;Yizhuo Wang;Weixing Ji","doi":"10.1109/TCAD.2025.3554144","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3554144","url":null,"abstract":"The CPU-FPGA heterogeneous computing architecture is extensively employed in the embedded domain due to its low cost and power efficiency, with numerous sparse matrix-vector multiplication (SpMV) acceleration efforts already targeting this architecture. However, existing work rarely includes collaborative SpMV computations between CPU and FPGA, which limits the exploration of hybrid architectures that could potentially offer enhanced performance and flexibility. This article introduces an FPGA architecture design that supports multiprecision SpMV computations, including FP16, FP32, and FP64. Building on this, PTPS, a precision-aware SpMV task partitioning and dynamic scheduling algorithm tailored for the CPU-FPGA heterogeneous architecture, is proposed. The core idea of PTPS is lossless partitioning of sparse matrices across multiple precisions, prioritizing low-precision SpMV computations on the FPGA and high-precision computations on the CPU. PTPS not only leverages the strengths of CPU and FPGA for collaborative SpMV computations but also reduces data transmission overhead between them, thereby improving the overall computational efficiency. Experimental evaluation demonstrates that the proposed approach offers an average speedup of <inline-formula> <tex-math>$1.57times $ </tex-math></inline-formula> over the CPU-only approach and <inline-formula> <tex-math>$2.58times $ </tex-math></inline-formula> over the FPGA-only approach.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3804-3815"},"PeriodicalIF":2.9,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modular Functional Test Sequences for Test Compaction 测试压实的模块化功能测试序列
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-23 DOI: 10.1109/TCAD.2025.3573223
Irith Pomeranz
Ensuring correct functional operation of a chip requires extensive testing. Without the constraints of maintaining functional operation conditions, structural (scan-based) tests allow high-fault coverage to be achieved efficiently. To cover defects that are only exhibited under functional operation conditions, functional test sequences are used for complementing scan-based tests. One of the limitations of functional test sequences is their length, making it important to apply test compaction. To avoid losing the functional properties of a sequence when test compaction is applied at the gate level, design-for-testability (DFT) logic can be used for keeping the circuit in its functional state space. In this context, this article suggests the new concept of a modular functional test sequence consisting of subsequences that can be plugged in or out to increase the fault coverage or reduce the sequence length. To support modularity at the gate level, DFT logic is used for restoring functional states between subsequences. Modularity offers the key advantage that a single compact functional test sequence can be constructed from a given pool of functional test sequences, and the modular sequence can be updated as additional sequences become available in the pool, or additional fault models are targeted. The article develops a procedure for the generation and compaction of modular sequences using subsequences from a given pool, and presents experimental results for benchmark circuits in an academic simulation environment to demonstrate its effectiveness and limitations.
确保芯片的正确功能操作需要大量的测试。没有维持功能操作条件的限制,结构(基于扫描的)测试可以有效地实现高故障覆盖率。为了覆盖仅在功能操作条件下显示的缺陷,功能测试序列用于补充基于扫描的测试。功能测试序列的限制之一是它们的长度,因此应用测试压缩非常重要。为了避免在门级应用测试压缩时丢失序列的功能特性,可测试性设计(DFT)逻辑可用于将电路保持在其功能状态空间中。在这种情况下,本文提出了模块化功能测试序列的新概念,该序列由可插入或拔出的子序列组成,以增加故障覆盖率或减少序列长度。为了支持门级的模块化,DFT逻辑用于恢复子序列之间的功能状态。模块化提供了一个关键的优势,即可以从给定的功能测试序列池中构造一个单一的紧凑的功能测试序列,并且可以在池中出现额外的序列时更新模块化序列,或者针对额外的故障模型进行更新。本文开发了一种使用给定池中的子序列生成和压缩模块序列的程序,并在学术模拟环境中给出了基准电路的实验结果,以证明其有效性和局限性。
{"title":"Modular Functional Test Sequences for Test Compaction","authors":"Irith Pomeranz","doi":"10.1109/TCAD.2025.3573223","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3573223","url":null,"abstract":"Ensuring correct functional operation of a chip requires extensive testing. Without the constraints of maintaining functional operation conditions, structural (scan-based) tests allow high-fault coverage to be achieved efficiently. To cover defects that are only exhibited under functional operation conditions, functional test sequences are used for complementing scan-based tests. One of the limitations of functional test sequences is their length, making it important to apply test compaction. To avoid losing the functional properties of a sequence when test compaction is applied at the gate level, design-for-testability (DFT) logic can be used for keeping the circuit in its functional state space. In this context, this article suggests the new concept of a modular functional test sequence consisting of subsequences that can be plugged in or out to increase the fault coverage or reduce the sequence length. To support modularity at the gate level, DFT logic is used for restoring functional states between subsequences. Modularity offers the key advantage that a single compact functional test sequence can be constructed from a given pool of functional test sequences, and the modular sequence can be updated as additional sequences become available in the pool, or additional fault models are targeted. The article develops a procedure for the generation and compaction of modular sequences using subsequences from a given pool, and presents experimental results for benchmark circuits in an academic simulation environment to demonstrate its effectiveness and limitations.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"407-417"},"PeriodicalIF":2.9,"publicationDate":"2025-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High Throughput and Compact FPGA TRNGs Based on Hybrid Entropy, Reinforcement Strategies, and Automated Exploration 基于混合熵、强化策略和自动探索的高吞吐量和紧凑FPGA trng
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-22 DOI: 10.1109/TCAD.2025.3572838
Yuan Zhang;Kuncai Zhong;Jiliang Zhang
As a vital security primitive, the true random number generator (TRNG) is a mandatory component to build trust roots for any encryption system. However, existing TRNGs suffer from bottlenecks of low throughput and high area-energy consumption. Additionally, the electronic design automation (EDA) design of TRNG for specific applications remains an unexplored area. To address these issues, in this work, we propose compact and high-throughput TRNGs based on dynamic hybrid, reinforcement strategies, and automated exploration. First, we present a dynamic hybrid entropy unit and reinforcement strategies to provide sufficient randomness. On this basis, we propose a high-efficiency dynamic hybrid TRNG (DH-TRNG) architecture. It exhibits portability to distinct process field programmable gate arrays (FPGAs) and passes both NIST and AIS-31 tests without any post-processing. The experiments show it incurs only 8 slices with the highest throughput of 670 and 620 Mb/s on Xilinx Virtex-6 and Artix-7, respectively. Compared to the state-of-the-art TRNGs, DH-TRNG has the highest (Throughput/Slices·Power) with $2.63times $ increase. In addition, we propose an automated exploration scheme as a preliminary EDA design for TRNG to better apply to resource-constrained scenarios. This scheme automatically explores TRNGs to meet the design requirements and further reduces the hardware overhead, indicating broad application prospects in TRNG automation design. Finally, we apply the proposed DH-TRNG and the results of automated exploration to stochastic computing (SC) for edge detection, achieving promising outcomes.
作为一种重要的安全原语,真随机数生成器(TRNG)是任何加密系统构建信任根的必备组件。然而,现有的trng存在吞吐量低、面积能耗高的瓶颈。此外,针对特定应用的TRNG的电子设计自动化(EDA)设计仍然是一个未开发的领域。为了解决这些问题,在这项工作中,我们提出了基于动态混合、强化策略和自动探索的紧凑高通量trng。首先,我们提出了一个动态混合熵单元和增强策略,以提供足够的随机性。在此基础上,提出了一种高效动态混合TRNG (DH-TRNG)架构。它具有可移植性,可用于不同的过程场可编程门阵列(fpga),并且无需任何后处理即可通过NIST和AIS-31测试。实验表明,在Xilinx Virtex-6和Artix-7上,它只产生8个切片,最高吞吐量分别为670和620 Mb/s。与最先进的trng相比,DH-TRNG具有最高的(吞吐量/切片·功率),增加了2.63倍。此外,我们提出了一个自动勘探方案作为TRNG的初步EDA设计,以更好地应用于资源受限的场景。该方案自动探索TRNG以满足设计要求,进一步降低了硬件开销,在TRNG自动化设计中具有广阔的应用前景。最后,我们将提出的DH-TRNG和自动探索的结果应用于随机计算(SC)的边缘检测,取得了令人满意的结果。
{"title":"High Throughput and Compact FPGA TRNGs Based on Hybrid Entropy, Reinforcement Strategies, and Automated Exploration","authors":"Yuan Zhang;Kuncai Zhong;Jiliang Zhang","doi":"10.1109/TCAD.2025.3572838","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3572838","url":null,"abstract":"As a vital security primitive, the true random number generator (TRNG) is a mandatory component to build trust roots for any encryption system. However, existing TRNGs suffer from bottlenecks of low throughput and high area-energy consumption. Additionally, the electronic design automation (EDA) design of TRNG for specific applications remains an unexplored area. To address these issues, in this work, we propose compact and high-throughput TRNGs based on dynamic hybrid, reinforcement strategies, and automated exploration. First, we present a dynamic hybrid entropy unit and reinforcement strategies to provide sufficient randomness. On this basis, we propose a high-efficiency dynamic hybrid TRNG (DH-TRNG) architecture. It exhibits portability to distinct process field programmable gate arrays (FPGAs) and passes both NIST and AIS-31 tests without any post-processing. The experiments show it incurs only 8 slices with the highest throughput of 670 and 620 Mb/s on Xilinx Virtex-6 and Artix-7, respectively. Compared to the state-of-the-art TRNGs, DH-TRNG has the highest (Throughput/Slices·Power) with <inline-formula> <tex-math>$2.63times $ </tex-math></inline-formula> increase. In addition, we propose an automated exploration scheme as a preliminary EDA design for TRNG to better apply to resource-constrained scenarios. This scheme automatically explores TRNGs to meet the design requirements and further reduces the hardware overhead, indicating broad application prospects in TRNG automation design. Finally, we apply the proposed DH-TRNG and the results of automated exploration to stochastic computing (SC) for edge detection, achieving promising outcomes.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"519-532"},"PeriodicalIF":2.9,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1