首页 > 最新文献

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems最新文献

英文 中文
An Optimization-Aware Prerouting Timing Prediction Framework Based on Multimodal Learning 基于多模态学习的优化感知预路由时间预测框架
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-04 DOI: 10.1109/TCAD.2025.3547806
Peng Cao;Yusen Qin;Guoqing He;Wenjie Ding;Xu Cheng;Zhanhua Zhang;Yuyang Ye
Accurate and efficient prerouting timing estimation is particularly crucial during placement to alleviate time-consuming design iterations. Machine-learning (ML)-based methods have been introduced recently to predict the post-routing timing results at placement stage, but most of them neglect the impact of timing optimization during physical design, suffering from accuracy loss due to inconsistent circuit netlist. In this work, an optimization-aware prerouting timing prediction framework based on multimodal learning is proposed to calibrate the timing changes between placement and routing stages, where the local netlist and layout information are extracted by graph neural network (GNN) and convolutional neural network (CNN), respectively, while the global information along the path is further extracted by Transformer network. Based on the predicted post-routing timing results by the proposed framework, timing optimization guidance is generated to enhance traditional design flow with better physical implementation quality. Experimental results demonstrate that for the OpenCores benchmark circuits under TSMC 22nm process, the proposed framework achieves significant correlation and accuracy improvement with an average of 0.9219 in terms of R2 score and 2.12% of mean absolute percentage error (MAPE) as well as an average runtime acceleration of $645times $ compared with traditional design flow on testing designs. With the timing optimization guidance, significant worst negative slack (WNS) and total negative slack (TNS) improvement are achieved compared with traditional flow after placement and routing, respectively, without noticeable area, power, wire length, and the number of design rule check (DRC) violations increase.
准确和有效的预路由时间估计在放置过程中特别重要,以减少耗时的设计迭代。近年来,基于机器学习(ML)的方法被引入到布线后的时序预测中,但大多数方法忽略了物理设计中时序优化的影响,由于电路网表不一致而导致精度损失。本文提出了一种基于多模态学习的优化感知预路由时序预测框架,该框架通过图神经网络(GNN)和卷积神经网络(CNN)分别提取局部网表和布局信息,同时通过Transformer网络进一步提取路径沿线的全局信息。基于该框架对路由后时序结果的预测,生成时序优化指导,以更好的物理实现质量增强传统设计流程。实验结果表明,对于TSMC 22nm制程下的OpenCores基准电路,与传统设计流程相比,该框架在测试设计上的R2得分平均为0.9219,平均绝对百分比误差(MAPE)平均为2.12%,平均运行时加速为645倍。在时序优化指导下,与传统流相比,布置和布线后的最坏负松弛量(WNS)和总负松弛量(TNS)均有显著改善,且面积、功率、导线长度和设计规则校验(DRC)违规次数均未明显增加。
{"title":"An Optimization-Aware Prerouting Timing Prediction Framework Based on Multimodal Learning","authors":"Peng Cao;Yusen Qin;Guoqing He;Wenjie Ding;Xu Cheng;Zhanhua Zhang;Yuyang Ye","doi":"10.1109/TCAD.2025.3547806","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3547806","url":null,"abstract":"Accurate and efficient prerouting timing estimation is particularly crucial during placement to alleviate time-consuming design iterations. Machine-learning (ML)-based methods have been introduced recently to predict the post-routing timing results at placement stage, but most of them neglect the impact of timing optimization during physical design, suffering from accuracy loss due to inconsistent circuit netlist. In this work, an optimization-aware prerouting timing prediction framework based on multimodal learning is proposed to calibrate the timing changes between placement and routing stages, where the local netlist and layout information are extracted by graph neural network (GNN) and convolutional neural network (CNN), respectively, while the global information along the path is further extracted by Transformer network. Based on the predicted post-routing timing results by the proposed framework, timing optimization guidance is generated to enhance traditional design flow with better physical implementation quality. Experimental results demonstrate that for the OpenCores benchmark circuits under TSMC 22nm process, the proposed framework achieves significant correlation and accuracy improvement with an average of 0.9219 in terms of R2 score and 2.12% of mean absolute percentage error (MAPE) as well as an average runtime acceleration of <inline-formula> <tex-math>$645times $ </tex-math></inline-formula> compared with traditional design flow on testing designs. With the timing optimization guidance, significant worst negative slack (WNS) and total negative slack (TNS) improvement are achieved compared with traditional flow after placement and routing, respectively, without noticeable area, power, wire length, and the number of design rule check (DRC) violations increase.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3896-3909"},"PeriodicalIF":2.9,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HaloFL: Efficient Heterogeneity-Aware Federated Learning Through Optimal Submodel Extraction and Dynamic Sparse Adjustment 基于最优子模型提取和动态稀疏调整的高效异构感知联邦学习
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-04 DOI: 10.1109/TCAD.2025.3548003
Zirui Lian;Qianyue Cao;Chao Liang;Jing Cao;Zongwei Zhu;Zhi Yang;Cheng Ji;Changlong Li;Xuehai Zhou
Federated learning (FL) is an advanced framework that enables collaborative training of machine learning models across edge devices. An effective strategy to enhance training efficiency is to allocate the optimal submodel based on each device’s resource capabilities. However, system heterogeneity significantly increases the difficulty of allocating submodel parameter budgets appropriately for each device, leading to the straggler problem. Meanwhile, data heterogeneity complicates the selection of the optimal submodel structure for specific devices, thereby impacting training performance. Furthermore, the dynamic nature of edge environments, such as fluctuations in network communication and computational resources, exacerbates these challenges, making it even more difficult to precisely extract appropriately sized and structured submodels from the global model. To address the challenges in heterogeneous training environments, we propose an efficient FL framework, namely, HaloFL. The framework dynamically adjusts the structure and parameter budget of submodels during training by evaluating three dimensions: 1) model-wise performance; 2) layer-wise performance; and 3) unit-wise performance. First, we design a data-aware model unit importance evaluation method to determine the optimal submodel structure for different data distributions. Next, using this evaluation method, we analyze the importance of model layers and reallocate parameters from noncritical layers to critical layers within a fixed parameter budget, further optimizing the submodel structure. Finally, we introduce a resource-aware dual-UCB multiarmed bandit agent, which dynamically adjusts the total parameter budget of submodels according to changes in the training environment, allowing the framework to better adapt to the performance differences of heterogeneous devices. Experimental results demonstrate that HaloFL exhibits outstanding efficiency in various dynamic and heterogeneous scenarios, achieving up to a 14.80% improvement in accuracy and a $3.06times $ speedup compared to existing FL frameworks.
联邦学习(FL)是一种高级框架,可以跨边缘设备协作训练机器学习模型。根据每个设备的资源能力分配最优子模型是提高训练效率的有效策略。然而,系统的异构性极大地增加了为每个设备合理分配子模型参数预算的难度,从而导致了离散问题。同时,数据的异质性使特定设备的最优子模型结构的选择复杂化,从而影响训练性能。此外,边缘环境的动态性,如网络通信和计算资源的波动,加剧了这些挑战,使从全局模型中精确提取适当大小和结构的子模型变得更加困难。为了解决异构训练环境中的挑战,我们提出了一个高效的FL框架,即HaloFL。该框架通过评估三个维度来动态调整训练过程中子模型的结构和参数预算:1)模型智能性能;2)分层性能;3)单位性能。首先,我们设计了一种数据感知模型单元重要性评估方法,以确定不同数据分布下的最优子模型结构。其次,利用该评价方法分析模型层的重要性,并在固定的参数预算范围内,将参数从非关键层重新分配到关键层,进一步优化子模型结构。最后,我们引入了一种资源感知的双ucb多臂强盗智能体,该智能体根据训练环境的变化动态调整子模型的总参数预算,使框架能够更好地适应异构设备的性能差异。实验结果表明,HaloFL在各种动态和异构场景下表现出出色的效率,与现有的FL框架相比,准确率提高了14.80%,速度提高了3.06倍。
{"title":"HaloFL: Efficient Heterogeneity-Aware Federated Learning Through Optimal Submodel Extraction and Dynamic Sparse Adjustment","authors":"Zirui Lian;Qianyue Cao;Chao Liang;Jing Cao;Zongwei Zhu;Zhi Yang;Cheng Ji;Changlong Li;Xuehai Zhou","doi":"10.1109/TCAD.2025.3548003","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3548003","url":null,"abstract":"Federated learning (FL) is an advanced framework that enables collaborative training of machine learning models across edge devices. An effective strategy to enhance training efficiency is to allocate the optimal submodel based on each device’s resource capabilities. However, system heterogeneity significantly increases the difficulty of allocating submodel parameter budgets appropriately for each device, leading to the straggler problem. Meanwhile, data heterogeneity complicates the selection of the optimal submodel structure for specific devices, thereby impacting training performance. Furthermore, the dynamic nature of edge environments, such as fluctuations in network communication and computational resources, exacerbates these challenges, making it even more difficult to precisely extract appropriately sized and structured submodels from the global model. To address the challenges in heterogeneous training environments, we propose an efficient FL framework, namely, HaloFL. The framework dynamically adjusts the structure and parameter budget of submodels during training by evaluating three dimensions: 1) model-wise performance; 2) layer-wise performance; and 3) unit-wise performance. First, we design a data-aware model unit importance evaluation method to determine the optimal submodel structure for different data distributions. Next, using this evaluation method, we analyze the importance of model layers and reallocate parameters from noncritical layers to critical layers within a fixed parameter budget, further optimizing the submodel structure. Finally, we introduce a resource-aware dual-UCB multiarmed bandit agent, which dynamically adjusts the total parameter budget of submodels according to changes in the training environment, allowing the framework to better adapt to the performance differences of heterogeneous devices. Experimental results demonstrate that HaloFL exhibits outstanding efficiency in various dynamic and heterogeneous scenarios, achieving up to a 14.80% improvement in accuracy and a <inline-formula> <tex-math>$3.06times $ </tex-math></inline-formula> speedup compared to existing FL frameworks.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3518-3531"},"PeriodicalIF":2.9,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Circuit Partitioning and Transmission Cost Optimization in Distributed Quantum Circuits 分布式量子电路的电路划分与传输成本优化
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-04 DOI: 10.1109/TCAD.2025.3547812
Xinyu Chen;Zilu Chen;Pengcheng Zhu;Xueyun Cheng;Zhijin Guan
Given the limitations on the number of qubits in current noisy intermediate-scale quantum (NISQ) devices, the implementation of large-scale quantum algorithms on such devices is challenging, prompting research into distributed quantum computing. This article focuses on the issue of excessive communication complexity in distributed quantum computing based on the quantum circuit model. To reduce the number of quantum state transmissions, i.e., the transmission cost, in distributed quantum circuits, a circuit partitioning method based on the quadratic unconstrained binary optimization (QUBO) model is proposed, coupled with the lookahead method for transmission cost optimization. Initially, the problem of distributed quantum circuit partitioning is transformed into a graph minimum cut problem. The QUBO model, which can be accelerated by quantum annealing algorithms, is introduced to minimize the number of quantum gates between quantum processing units (QPUs) and the transmission cost. Subsequently, the dynamic lookahead strategy for the selection of transmission qubits is proposed to optimize the transmission cost in distributed quantum circuits. Finally, through numerical simulations, the impact of different circuit partitioning indicators on the transmission cost is explored, and the proposed method is evaluated on benchmark circuits. Experimental results demonstrate that the proposed circuit partitioning method has a shorter runtime compared with current circuit partitioning methods. Additionally, the transmission cost optimized by the proposed method is significantly lower than that of current transmission cost optimization methods, achieving noticeable improvements across different numbers of partitions.
鉴于当前噪声中尺度量子(NISQ)设备中量子比特数量的限制,在此类设备上实现大规模量子算法是具有挑战性的,这促使了对分布式量子计算的研究。本文主要研究基于量子电路模型的分布式量子计算中通信复杂度过高的问题。为了减少分布式量子电路中量子态传输的数量即传输成本,提出了一种基于二次无约束二进制优化(QUBO)模型的电路划分方法,并结合传输成本优化的前瞻性方法。首先,将分布式量子电路划分问题转化为图最小割问题。为了最大限度地减少量子处理单元(qpu)之间的量子门数量和传输成本,引入了量子退火算法加速的QUBO模型。随后,为了优化分布式量子电路中的传输成本,提出了传输量子比特的动态前瞻选择策略。最后,通过数值模拟,探讨了不同电路划分指标对传输成本的影响,并在基准电路上对所提方法进行了评估。实验结果表明,与现有的电路划分方法相比,所提出的电路划分方法具有更短的运行时间。此外,该方法优化的传输成本明显低于现有的传输成本优化方法,在不同分区数量上都有明显的改进。
{"title":"Circuit Partitioning and Transmission Cost Optimization in Distributed Quantum Circuits","authors":"Xinyu Chen;Zilu Chen;Pengcheng Zhu;Xueyun Cheng;Zhijin Guan","doi":"10.1109/TCAD.2025.3547812","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3547812","url":null,"abstract":"Given the limitations on the number of qubits in current noisy intermediate-scale quantum (NISQ) devices, the implementation of large-scale quantum algorithms on such devices is challenging, prompting research into distributed quantum computing. This article focuses on the issue of excessive communication complexity in distributed quantum computing based on the quantum circuit model. To reduce the number of quantum state transmissions, i.e., the transmission cost, in distributed quantum circuits, a circuit partitioning method based on the quadratic unconstrained binary optimization (QUBO) model is proposed, coupled with the lookahead method for transmission cost optimization. Initially, the problem of distributed quantum circuit partitioning is transformed into a graph minimum cut problem. The QUBO model, which can be accelerated by quantum annealing algorithms, is introduced to minimize the number of quantum gates between quantum processing units (QPUs) and the transmission cost. Subsequently, the dynamic lookahead strategy for the selection of transmission qubits is proposed to optimize the transmission cost in distributed quantum circuits. Finally, through numerical simulations, the impact of different circuit partitioning indicators on the transmission cost is explored, and the proposed method is evaluated on benchmark circuits. Experimental results demonstrate that the proposed circuit partitioning method has a shorter runtime compared with current circuit partitioning methods. Additionally, the transmission cost optimized by the proposed method is significantly lower than that of current transmission cost optimization methods, achieving noticeable improvements across different numbers of partitions.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3350-3362"},"PeriodicalIF":2.9,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing Neuromorphic Architecture Toward Emerging Spiking Neural Network on FPGA 面向新兴脉冲神经网络的FPGA神经形态结构的推进
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-03 DOI: 10.1109/TCAD.2025.3547275
Yingxue Gao;Teng Wang;Yang Yang;Lei Gong;Xianglan Chen;Chao Wang;Xi Li;Xuehai Zhou
Spiking neural networks (SNNs) replace the multiply-and-accumulate operations in traditional artificial neural networks (ANNs) with lightweight mask-and-accumulate operations, achieving greater performance. Existing SNN architectures are primarily designed based on fully-connected or convolutional SNN topologies and still struggle with low task accuracy, limiting their practical applications. Recently, transformer SNN (TSNN) models have shown promise in matching the accuracy of nonspiking ANNs and demonstrated potential application prospects. However, their diverse computation pattern and sophisticated network structure with high computation and memory footprints impede their efficient deployment. Thus, in this work, we move our attention to heterogeneous architecture design and propose SpikeTA, the first neuromorphic hardware accelerator explicitly designed for the TSNN model on FPGA. First, SpikeTA enables parameterizable hardware engines (HEs) designed for the network layers in TSNN, enhancing compatibility between HEs and network layers. Second, SpikeTA optimizes arithmetic operations between binary spikes and synaptic weights by presenting a DSP-efficient addition tree. By analyzing the inherent data characteristics, SpikeTA further introduces a depth-aware buffer management strategy to provide sufficient access ports. Third, SpikeTA employs a streaming dataflow mapping to optimize data transmission granularity and leverages a split-engine dataflow mapping to facilitate pipelined latency balancing. Experimental results demonstrate that SpikeTA achieves significant performance speedups of $140.73times $ $1023.53times $ and $2.97times $ $7.29times $ over architectures running on the AMD EPYC 7542 CPU and NVIDIA A100 GPU, respectively. SpikeTA also outperforms state-of-the-art SNN and Transformer accelerators by $2.79times $ and $2.66times $ in architecture performance while achieving a peak performance of 28.99 TOPs.
尖峰神经网络(SNNs)用轻量级的掩模累加操作取代了传统人工神经网络(ann)中的乘法累加操作,获得了更高的性能。现有的SNN架构主要是基于全连接或卷积SNN拓扑设计的,并且仍然存在低任务精度的问题,限制了它们的实际应用。近年来,变压器SNN (TSNN)模型在匹配非尖峰人工神经网络的精度方面表现出了良好的前景,并展示了潜在的应用前景。然而,它们的计算模式多样,网络结构复杂,计算和内存占用大,阻碍了它们的高效部署。因此,在这项工作中,我们将注意力转移到异构架构设计上,并提出了SpikeTA,这是第一个明确为FPGA上的TSNN模型设计的神经形态硬件加速器。首先,SpikeTA支持为TSNN中的网络层设计的可参数化硬件引擎(he),增强了he与网络层之间的兼容性。其次,SpikeTA通过提供dsp高效的加法树来优化二进制尖峰和突触权重之间的算术运算。通过分析固有的数据特征,SpikeTA进一步引入了深度感知缓冲区管理策略,以提供足够的访问端口。第三,SpikeTA采用流数据流映射来优化数据传输粒度,并利用分裂引擎数据流映射来促进流水线延迟平衡。实验结果表明,SpikeTA在AMD EPYC 7542 CPU和NVIDIA A100 GPU上分别实现了140.73倍至1023.53倍和2.97倍至7.29倍的显著性能提升。SpikeTA在架构性能上也比最先进的SNN和Transformer加速器分别高出2.79美元和2.66美元,同时达到28.99 TOPs的峰值性能。
{"title":"Advancing Neuromorphic Architecture Toward Emerging Spiking Neural Network on FPGA","authors":"Yingxue Gao;Teng Wang;Yang Yang;Lei Gong;Xianglan Chen;Chao Wang;Xi Li;Xuehai Zhou","doi":"10.1109/TCAD.2025.3547275","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3547275","url":null,"abstract":"Spiking neural networks (SNNs) replace the multiply-and-accumulate operations in traditional artificial neural networks (ANNs) with lightweight mask-and-accumulate operations, achieving greater performance. Existing SNN architectures are primarily designed based on fully-connected or convolutional SNN topologies and still struggle with low task accuracy, limiting their practical applications. Recently, transformer SNN (TSNN) models have shown promise in matching the accuracy of nonspiking ANNs and demonstrated potential application prospects. However, their diverse computation pattern and sophisticated network structure with high computation and memory footprints impede their efficient deployment. Thus, in this work, we move our attention to heterogeneous architecture design and propose SpikeTA, the first neuromorphic hardware accelerator explicitly designed for the TSNN model on FPGA. First, SpikeTA enables parameterizable hardware engines (HEs) designed for the network layers in TSNN, enhancing compatibility between HEs and network layers. Second, SpikeTA optimizes arithmetic operations between binary spikes and synaptic weights by presenting a DSP-efficient addition tree. By analyzing the inherent data characteristics, SpikeTA further introduces a depth-aware buffer management strategy to provide sufficient access ports. Third, SpikeTA employs a streaming dataflow mapping to optimize data transmission granularity and leverages a split-engine dataflow mapping to facilitate pipelined latency balancing. Experimental results demonstrate that SpikeTA achieves significant performance speedups of <inline-formula> <tex-math>$140.73times $ </tex-math></inline-formula>–<inline-formula> <tex-math>$1023.53times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$2.97times $ </tex-math></inline-formula>–<inline-formula> <tex-math>$7.29times $ </tex-math></inline-formula> over architectures running on the AMD EPYC 7542 CPU and NVIDIA A100 GPU, respectively. SpikeTA also outperforms state-of-the-art SNN and Transformer accelerators by <inline-formula> <tex-math>$2.79times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$2.66times $ </tex-math></inline-formula> in architecture performance while achieving a peak performance of 28.99 TOPs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3465-3478"},"PeriodicalIF":2.9,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Theseus: Exploring Efficient Wafer-Scale Chip Design for Large Language Models 为大型语言模型探索高效的晶圆级芯片设计
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-02 DOI: 10.1109/TCAD.2025.3566297
Jingchen Zhu;Chenhao Xue;Yiqi Chen;Zhao Wang;Chen Zhang;Yu Shen;Yifan Chen;Zekang Cheng;Yu Jiang;Tianqi Wang;Yibo Lin;Wei Hu;Bin Cui;Runsheng Wang;Yun Liang;Guangyu Sun
The emergence of the large language model (LLM) poses an exponential growth of demand for computation throughput, memory capacity, and communication bandwidth. Such a demand growth has significantly surpassed the improvement of corresponding chip designs. With the advancement of fabrication and integration technologies, designers have been developing wafer-scale chips (WSCs) to scale up and exploit the limits of computation density, memory capacity, and communication bandwidth at the level of a single chip. Existing solutions have demonstrated the significant advantages of WSCs over traditional designs, showing potential to effectively support LLM workloads. Despite the benefits, exploring the early-stage design space of WSCs for LLMs is a crucial yet challenging task due to the enormous and complicated design space, time-consuming evaluation methods, and inefficient exploration strategies. To address these challenges, we propose Theseus, an efficient WSC design space exploration framework for LLMs. We construct the design space of WSCs with various constraints considering the unique characteristics of WSCs. We propose efficient evaluation methodologies for large-scale NoC-based WSCs and introduce multifidelity Bayesian optimization to efficiently explore the design space. Evaluation results demonstrate the efficiency of Theseus that the searched Pareto optimal results outperform GPU cluster and existing WSC designs by up to 62.8%/73.7% in performance (with the same or lower power) and 38.6%/42.4% in power consumption (with the same or higher performance) for LLM training, while improving up to $23.2times $ and $15.7times $ for the performance and power of inference tasks. Furthermore, we conduct case studies to address the design tradeoffs in WSCs and provide insights to facilitate WSC designs for LLMs.
随着大型语言模型(LLM)的出现,对计算吞吐量、内存容量和通信带宽的需求呈指数级增长。这样的需求增长大大超过了相应芯片设计的改进。随着制造和集成技术的进步,设计人员一直在开发晶圆级芯片(WSCs),以扩展和利用单芯片级别的计算密度,内存容量和通信带宽的限制。现有的解决方案已经证明了WSCs相对于传统设计的显著优势,显示出有效支持LLM工作负载的潜力。尽管有这些好处,但由于巨大而复杂的设计空间、耗时的评估方法和低效的探索策略,为法学硕士探索WSCs的早期设计空间是一项至关重要且具有挑战性的任务。为了应对这些挑战,我们提出了Theseus,一个有效的WSC设计空间探索框架,用于法学硕士。考虑到WSCs的独特性,我们构建了具有多种约束条件的WSCs设计空间。我们提出了大规模基于noc的wsc的有效评估方法,并引入了多保真贝叶斯优化来有效地探索设计空间。评估结果证明了Theseus的效率,搜索的Pareto最优结果在性能(相同或更低功耗)和功耗(相同或更高性能)方面优于GPU集群和现有WSC设计高达62.8%/73.7%,在功耗(相同或更高性能)方面优于38.6%/42.4%,而在推理任务的性能和功率方面分别提高了23.2times $和15.7times $。此外,我们进行案例研究,以解决WSC的设计权衡,并提供见解,以促进法学硕士的WSC设计。
{"title":"Theseus: Exploring Efficient Wafer-Scale Chip Design for Large Language Models","authors":"Jingchen Zhu;Chenhao Xue;Yiqi Chen;Zhao Wang;Chen Zhang;Yu Shen;Yifan Chen;Zekang Cheng;Yu Jiang;Tianqi Wang;Yibo Lin;Wei Hu;Bin Cui;Runsheng Wang;Yun Liang;Guangyu Sun","doi":"10.1109/TCAD.2025.3566297","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3566297","url":null,"abstract":"The emergence of the large language model (LLM) poses an exponential growth of demand for computation throughput, memory capacity, and communication bandwidth. Such a demand growth has significantly surpassed the improvement of corresponding chip designs. With the advancement of fabrication and integration technologies, designers have been developing wafer-scale chips (WSCs) to scale up and exploit the limits of computation density, memory capacity, and communication bandwidth at the level of a single chip. Existing solutions have demonstrated the significant advantages of WSCs over traditional designs, showing potential to effectively support LLM workloads. Despite the benefits, exploring the early-stage design space of WSCs for LLMs is a crucial yet challenging task due to the enormous and complicated design space, time-consuming evaluation methods, and inefficient exploration strategies. To address these challenges, we propose Theseus, an efficient WSC design space exploration framework for LLMs. We construct the design space of WSCs with various constraints considering the unique characteristics of WSCs. We propose efficient evaluation methodologies for large-scale NoC-based WSCs and introduce multifidelity Bayesian optimization to efficiently explore the design space. Evaluation results demonstrate the efficiency of Theseus that the searched Pareto optimal results outperform GPU cluster and existing WSC designs by up to 62.8%/73.7% in performance (with the same or lower power) and 38.6%/42.4% in power consumption (with the same or higher performance) for LLM training, while improving up to <inline-formula> <tex-math>$23.2times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$15.7times $ </tex-math></inline-formula> for the performance and power of inference tasks. Furthermore, we conduct case studies to address the design tradeoffs in WSCs and provide insights to facilitate WSC designs for LLMs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 12","pages":"4793-4806"},"PeriodicalIF":2.9,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AccSiM: State-Aware Simulation Acceleration for Simulink Models AccSiM:状态感知仿真加速Simulink模型
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-28 DOI: 10.1109/TCAD.2025.3546879
Yifan Cheng;Zehong Yu;Zhuo Su;Ting Chen;Xiaosong Zhang;Yu Jiang
Simulink has been widely used in embedded software development, which supports simulation to validate the correctness of models. However, as the scale and complexity of models in industrial applications grow, it is time-consuming for the simulation engine of Simulink to achieve high coverage and detect potential errors, especially accumulative errors. In this article, we propose AccSiM, an accelerating model simulation method for Simulink models via code generation. AccSiM generates simulation functionality code for Simulink models through simulation oriented instrumentation, including runtime data collection, data diagnosis, and state-aware acceleration. The final simulation code is constructed by composing all the instrumentation code with actor code generated from a predefined template library and integrating test cases import. After compiling and executing the code, AccSiM generates simulation results including coverage and diagnostic information. We implemented AccSiM and evaluated it on several benchmark Simulink models. Compared to Simulink’s simulation engine, AccSiM shows a $215.3times $ improvement in simulation efficiency, significantly reduces the time required for detecting errors. Furthermore, through the state-aware acceleration method, AccSiM yielded an additional $2.8{times }$ speedup. AccSiM also achieved greater coverage within equivalent time.
Simulink在嵌入式软件开发中得到了广泛的应用,它支持仿真来验证模型的正确性。然而,随着工业应用中模型规模和复杂性的增长,Simulink的仿真引擎要实现高覆盖和检测潜在误差,特别是累积误差,是非常耗时的。本文提出了一种基于代码生成的Simulink模型加速仿真方法AccSiM。AccSiM通过面向仿真的仪器为Simulink模型生成仿真功能代码,包括运行时数据收集、数据诊断和状态感知加速。最终的模拟代码是通过组合所有的工具代码和从预定义模板库生成的参与者代码以及集成测试用例导入来构建的。在编译和执行代码后,AccSiM生成包含覆盖率和诊断信息的仿真结果。我们实现了AccSiM并在几个基准Simulink模型上对其进行了评估。与Simulink的仿真引擎相比,AccSiM的仿真效率提高了215.3倍,大大减少了检测错误所需的时间。此外,通过状态感知加速方法,AccSiM产生了额外的2.8{times}$加速。AccSiM也在相同的时间内实现了更大的覆盖范围。
{"title":"AccSiM: State-Aware Simulation Acceleration for Simulink Models","authors":"Yifan Cheng;Zehong Yu;Zhuo Su;Ting Chen;Xiaosong Zhang;Yu Jiang","doi":"10.1109/TCAD.2025.3546879","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3546879","url":null,"abstract":"Simulink has been widely used in embedded software development, which supports simulation to validate the correctness of models. However, as the scale and complexity of models in industrial applications grow, it is time-consuming for the simulation engine of Simulink to achieve high coverage and detect potential errors, especially accumulative errors. In this article, we propose A<sc>cc</small>S<sc>i</small>M, an accelerating model simulation method for Simulink models via code generation. A<sc>cc</small>S<sc>i</small>M generates simulation functionality code for Simulink models through simulation oriented instrumentation, including runtime data collection, data diagnosis, and state-aware acceleration. The final simulation code is constructed by composing all the instrumentation code with actor code generated from a predefined template library and integrating test cases import. After compiling and executing the code, A<sc>cc</small>S<sc>i</small>M generates simulation results including coverage and diagnostic information. We implemented A<sc>cc</small>S<sc>i</small>M and evaluated it on several benchmark Simulink models. Compared to Simulink’s simulation engine, A<sc>cc</small>S<sc>i</small>M shows a <inline-formula> <tex-math>$215.3times $ </tex-math></inline-formula> improvement in simulation efficiency, significantly reduces the time required for detecting errors. Furthermore, through the state-aware acceleration method, A<sc>cc</small>S<sc>i</small>M yielded an additional <inline-formula> <tex-math>$2.8{times }$ </tex-math></inline-formula> speedup. A<sc>cc</small>S<sc>i</small>M also achieved greater coverage within equivalent time.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3289-3302"},"PeriodicalIF":2.9,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HotReRAM: A Performance-Power–Thermal Simulation Framework for ReRAM-Based Caches hotream:基于reram的缓存的性能-功率-热模拟框架
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-27 DOI: 10.1109/TCAD.2025.3546855
Shounak Chakraborty;Thanasin Bunnam;Jedsada Arunruerk;Sukarn Agarwal;Shengqi Yu;Rishad Shafik;Magnus Själander
This article proposes a comprehensive thermal modeling and simulation framework, HotReRAM, for resistive RAM (ReRAM)-based caches that is verified against a memristor circuit-level model. The simulation is driven by power traces based on cache accesses for detailed temperature modeling over time. HotReRAM models power at a fine-grain level and generates temperature traces for different cache regions together with detailed analyses of thermal stability, retention time and write latency. Combining HotReRAM with gem5, a full-system simulator, and NVSim, a power simulator, for ReRAM enables temporal and spatial modeling of crucial ReRAM characteristics. This integration allows designers and architects to analyze various cache characteristics within a single cache bank and address thermal-induced issues when designing ReRAM caches. Our simulation results for an 8-MiB ReRAM cache show that the spatial thermal variance can be as high as 7 K for a single cache bank, whereas the temporal thermal variance is more than 40 K. Such temperature variances impact retention time with a standard deviation of 3.9–10.2 for a set of benchmark applications, where the write latency can increase by up to 14.5%.
本文提出了一个全面的热建模和仿真框架,hotream,用于基于电阻性RAM (ReRAM)的缓存,该缓存针对忆阻电路级模型进行了验证。仿真由基于缓存访问的功率跟踪驱动,以实现随时间的详细温度建模。hotream模型在细粒度水平上供电,并为不同的缓存区域生成温度轨迹,同时详细分析热稳定性,保留时间和写入延迟。将hotream与gem5(全系统模拟器)和NVSim(功率模拟器)相结合,可以实现ReRAM关键特性的时间和空间建模。这种集成允许设计人员和架构师分析单个缓存库中的各种缓存特性,并在设计ReRAM缓存时解决热引起的问题。我们对一个8 mib ReRAM缓存的模拟结果表明,单个缓存库的空间热方差可能高达7 K,而时间热方差超过40 K。对于一组基准应用程序,这种温度差异影响保持时间的标准差为3.9-10.2,其中写入延迟可能增加14.5%。
{"title":"HotReRAM: A Performance-Power–Thermal Simulation Framework for ReRAM-Based Caches","authors":"Shounak Chakraborty;Thanasin Bunnam;Jedsada Arunruerk;Sukarn Agarwal;Shengqi Yu;Rishad Shafik;Magnus Själander","doi":"10.1109/TCAD.2025.3546855","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3546855","url":null,"abstract":"This article proposes a comprehensive thermal modeling and simulation framework, HotReRAM, for resistive RAM (ReRAM)-based caches that is verified against a memristor circuit-level model. The simulation is driven by power traces based on cache accesses for detailed temperature modeling over time. HotReRAM models power at a fine-grain level and generates temperature traces for different cache regions together with detailed analyses of thermal stability, retention time and write latency. Combining HotReRAM with gem5, a full-system simulator, and NVSim, a power simulator, for ReRAM enables temporal and spatial modeling of crucial ReRAM characteristics. This integration allows designers and architects to analyze various cache characteristics within a single cache bank and address thermal-induced issues when designing ReRAM caches. Our simulation results for an 8-MiB ReRAM cache show that the spatial thermal variance can be as high as 7 K for a single cache bank, whereas the temporal thermal variance is more than 40 K. Such temperature variances impact retention time with a standard deviation of 3.9–10.2 for a set of benchmark applications, where the write latency can increase by up to 14.5%.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3363-3368"},"PeriodicalIF":2.9,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SiHGNN: Leveraging Properties of Semantic Graphs for Efficient HGNN Acceleration 利用语义图的属性实现高效的HGNN加速
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-27 DOI: 10.1109/TCAD.2025.3546881
Runzhen Xue;Mingyu Yan;Dengke Han;Ziheng Xiao;Zhimin Tang;Xiaochun Ye;Dongrui Fan
Heterogeneous graph neural networks (HGNNs) have expanded graph representation learning to heterogeneous graph fields. Recent studies have demonstrated their superior performance across various applications, including circuit representation, chip design automation, and placement optimization, often surpassing existing methods. However, GPUs often experience inefficiencies when executing HGNNs due to their unique and complex execution patterns. Compared to traditional graph neural networks (GNNs), these patterns further exacerbate irregularities in memory access. To tackle these challenges, recent studies have focused on developing domain-specific accelerators for HGNNs. Nonetheless, most of these efforts have concentrated on optimizing the datapath or scheduling data accesses, while largely overlooking the potential benefits that could be gained from leveraging the inherent properties of the semantic graph, such as its topology, layout, and generation. In this work, we focus on leveraging the properties of semantic graphs to enhance HGNN performance. First, we analyze the semantic graph build (SGB) stage and identify significant opportunities for data reuse during semantic graph generation. Next, we uncover the phenomenon of buffer thrashing during the graph feature processing (GFP) stage, revealing potential optimization opportunities in semantic graph layout. Furthermore, we propose a lightweight hardware accelerator frontend for HGNNs, called SiHGNN. This accelerator frontend incorporates a tree-based SGB for efficient semantic graph generation and features a novel Graph Restructurer for optimizing semantic graph layouts. Experimental results show that SiHGNN enables the state-of-the-art HGNN accelerator to achieve an average performance improvement of $2.95times $ .
异构图神经网络(hgnn)将图表示学习扩展到异构图域。最近的研究已经证明了它们在各种应用中的优越性能,包括电路表示,芯片设计自动化和放置优化,通常超过现有的方法。然而,gpu在执行hgnn时,由于其独特而复杂的执行模式,常常会遇到效率低下的问题。与传统的图神经网络(gnn)相比,这些模式进一步加剧了内存访问的不规则性。为了应对这些挑战,最近的研究集中在开发hgnn的特定域加速器上。尽管如此,大多数这些努力都集中在优化数据路径或调度数据访问上,而在很大程度上忽略了利用语义图的固有属性(如拓扑、布局和生成)可能获得的潜在好处。在这项工作中,我们专注于利用语义图的属性来提高HGNN的性能。首先,我们分析了语义图构建(SGB)阶段,并确定了语义图生成过程中数据重用的重要机会。接下来,我们揭示了图形特征处理(GFP)阶段的缓冲区抖动现象,揭示了语义图形布局中潜在的优化机会。此外,我们提出了一种用于hgnn的轻量级硬件加速器前端,称为SiHGNN。这个加速器前端集成了一个基于树的SGB,用于高效的语义图生成,并具有一个新颖的图形重构器,用于优化语义图布局。实验结果表明,SiHGNN使最先进的HGNN加速器的平均性能提高了2.95倍。
{"title":"SiHGNN: Leveraging Properties of Semantic Graphs for Efficient HGNN Acceleration","authors":"Runzhen Xue;Mingyu Yan;Dengke Han;Ziheng Xiao;Zhimin Tang;Xiaochun Ye;Dongrui Fan","doi":"10.1109/TCAD.2025.3546881","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3546881","url":null,"abstract":"Heterogeneous graph neural networks (HGNNs) have expanded graph representation learning to heterogeneous graph fields. Recent studies have demonstrated their superior performance across various applications, including circuit representation, chip design automation, and placement optimization, often surpassing existing methods. However, GPUs often experience inefficiencies when executing HGNNs due to their unique and complex execution patterns. Compared to traditional graph neural networks (GNNs), these patterns further exacerbate irregularities in memory access. To tackle these challenges, recent studies have focused on developing domain-specific accelerators for HGNNs. Nonetheless, most of these efforts have concentrated on optimizing the datapath or scheduling data accesses, while largely overlooking the potential benefits that could be gained from leveraging the inherent properties of the semantic graph, such as its topology, layout, and generation. In this work, we focus on leveraging the properties of semantic graphs to enhance HGNN performance. First, we analyze the semantic graph build (SGB) stage and identify significant opportunities for data reuse during semantic graph generation. Next, we uncover the phenomenon of buffer thrashing during the graph feature processing (GFP) stage, revealing potential optimization opportunities in semantic graph layout. Furthermore, we propose a lightweight hardware accelerator frontend for HGNNs, called SiHGNN. This accelerator frontend incorporates a tree-based SGB for efficient semantic graph generation and features a novel Graph Restructurer for optimizing semantic graph layouts. Experimental results show that SiHGNN enables the state-of-the-art HGNN accelerator to achieve an average performance improvement of <inline-formula> <tex-math>$2.95times $ </tex-math></inline-formula>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3490-3503"},"PeriodicalIF":2.9,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Cartesian Genetic Programming-Based Automatic Synthesis Framework for Reversible Quantum-Flux-Parametron Logic Circuits 基于高效笛卡尔遗传规划的可逆量子通量参数逻辑电路自动综合框架
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-27 DOI: 10.1109/TCAD.2025.3546884
Rongliang Fu;Robert Wille;Nobuyuki Yoshikawa;Tsung-Yi Ho
Reversible computing has garnered significant attention as a promising avenue for achieving energy-efficient computing systems, particularly within the realm of quantum computing. The reversible quantum-flux-parametron (RQFP) is the first practical reversible logic gate utilizing adiabatic superconducting devices, with experimental evidence supporting both its logical and physical reversibility. Each RQFP logic gate operates on alternating current (AC) power and features three input ports and three output ports. Notably, each output port is capable of implementing a majority function while driving only a single fan-out. Additionally, the three inputs to each gate must arrive in the same clock phase. These inherent characteristics present substantial challenges in the design of RQFP logic circuits. To address these challenges, this article proposes an automatic synthesis framework for RQFP logic circuit design based on efficient Cartesian genetic programming (CGP). The framework aims to minimize both the number of RQFP logic gates and the number of garbage outputs within the generated RQFP logic circuit. It incorporates the specific characteristics of the RQFP logic circuit by encoding them into the genotype of a CGP individual. It also introduces several point mutation operations to facilitate the generation of new individuals. Furthermore, the framework integrates circuit simulation with formal verification to assess the functional equivalence between the parent and its offspring. Experimental results on RevLib and reversible reciprocal circuit benchmarks demonstrate the effectiveness of our framework.
可逆计算作为实现节能计算系统的有前途的途径,特别是在量子计算领域,已经引起了极大的关注。可逆量子通量参数管(RQFP)是第一个实用的利用绝热超导器件的可逆逻辑门,具有实验证据支持其逻辑和物理可逆性。每个RQFP逻辑门在交流(AC)电源上工作,具有三个输入端口和三个输出端口。值得注意的是,每个输出端口都能够在仅驱动单个扇出时实现多数功能。此外,每个门的三个输入必须到达相同的时钟相位。这些固有的特性给RQFP逻辑电路的设计带来了巨大的挑战。为了解决这些问题,本文提出了一种基于高效笛卡尔遗传规划(CGP)的RQFP逻辑电路设计自动综合框架。该框架旨在最小化生成的RQFP逻辑电路中RQFP逻辑门的数量和垃圾输出的数量。它结合了RQFP逻辑电路的特定特征,将它们编码到CGP个体的基因型中。它还引入了几个点突变操作,以方便新个体的产生。此外,该框架将电路仿真与形式验证相结合,以评估父级和子级之间的功能等效性。在RevLib和可逆互反电路基准上的实验结果证明了该框架的有效性。
{"title":"Efficient Cartesian Genetic Programming-Based Automatic Synthesis Framework for Reversible Quantum-Flux-Parametron Logic Circuits","authors":"Rongliang Fu;Robert Wille;Nobuyuki Yoshikawa;Tsung-Yi Ho","doi":"10.1109/TCAD.2025.3546884","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3546884","url":null,"abstract":"Reversible computing has garnered significant attention as a promising avenue for achieving energy-efficient computing systems, particularly within the realm of quantum computing. The reversible quantum-flux-parametron (RQFP) is the first practical reversible logic gate utilizing adiabatic superconducting devices, with experimental evidence supporting both its logical and physical reversibility. Each RQFP logic gate operates on alternating current (AC) power and features three input ports and three output ports. Notably, each output port is capable of implementing a majority function while driving only a single fan-out. Additionally, the three inputs to each gate must arrive in the same clock phase. These inherent characteristics present substantial challenges in the design of RQFP logic circuits. To address these challenges, this article proposes an automatic synthesis framework for RQFP logic circuit design based on efficient Cartesian genetic programming (CGP). The framework aims to minimize both the number of RQFP logic gates and the number of garbage outputs within the generated RQFP logic circuit. It incorporates the specific characteristics of the RQFP logic circuit by encoding them into the genotype of a CGP individual. It also introduces several point mutation operations to facilitate the generation of new individuals. Furthermore, the framework integrates circuit simulation with formal verification to assess the functional equivalence between the parent and its offspring. Experimental results on RevLib and reversible reciprocal circuit benchmarks demonstrate the effectiveness of our framework.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3369-3380"},"PeriodicalIF":2.9,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Study of 3-D Line Edge Roughness (LER) in Vertical Channel Array Transistor for DRAM DRAM垂直通道阵列晶体管的三维线边缘粗糙度研究
IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-25 DOI: 10.1109/TCAD.2025.3546195
Jaehyuk Lim;Seokchan Yoon;Juho Sung;Sanghyun Kang;Gwon Kim;Hyoung Won Baac;Changhwan Shin
Line edge roughness (LER) is an undesirable phenomenon that arises during semiconductor fabrication processes, causing fluctuations in the characteristics of semiconductor devices and potentially leading to significant yield degradation. Consequently, LER must be meticulously considered before fabricating integrated circuits. In this study, we present an approach for implementing and analyzing LER in vertical channel array transistors (VCATs) with a gate-all-around (GAA) structure for dynamic random access memory applications. Initially, we propose a method for reliably implementing LER in GAA semiconductor devices. Next, we extend the method to more complex structures beyond the basic cylindrical GAA structure. Utilizing the proposed method, we investigate the impact of LER on various VCAT device configurations by examining DC performance metrics such as IOFF, IDS,LIN, IDS,SAT, VT,LIN, VT,SAT, IOV,LIN, and IOV,SAT. Additionally, we explore AC performance metrics (THOLD, TREAD, and TWRITE) through mixed-mode simulations. The results show that the parameters influencing LER-induced fluctuations in VCATs vary depending on the transistor’s operating region (i.e., whether the transistor is turned on or not).
线边缘粗糙度(LER)是半导体制造过程中出现的一种不良现象,会引起半导体器件特性的波动,并可能导致显着的良率下降。因此,在制造集成电路之前必须仔细考虑LER。在这项研究中,我们提出了一种在具有门全能(GAA)结构的垂直通道阵列晶体管(VCATs)中实现和分析动态随机存取存储器应用的方法。首先,我们提出了一种在GAA半导体器件中可靠实现LER的方法。接下来,我们将该方法扩展到更复杂的结构,超出基本的圆柱形GAA结构。利用所提出的方法,我们通过检查直流性能指标,如IOFF、IDS、LIN、IDS、SAT、VT、LIN、VT、SAT、IOV、LIN和IOV、SAT,来研究LER对各种VCAT设备配置的影响。此外,我们通过混合模式模拟探索交流性能指标(THOLD, TREAD和TWRITE)。结果表明,影响ler诱导的VCATs波动的参数随晶体管的工作区域(即晶体管是否导通)而变化。
{"title":"Study of 3-D Line Edge Roughness (LER) in Vertical Channel Array Transistor for DRAM","authors":"Jaehyuk Lim;Seokchan Yoon;Juho Sung;Sanghyun Kang;Gwon Kim;Hyoung Won Baac;Changhwan Shin","doi":"10.1109/TCAD.2025.3546195","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3546195","url":null,"abstract":"Line edge roughness (LER) is an undesirable phenomenon that arises during semiconductor fabrication processes, causing fluctuations in the characteristics of semiconductor devices and potentially leading to significant yield degradation. Consequently, LER must be meticulously considered before fabricating integrated circuits. In this study, we present an approach for implementing and analyzing LER in vertical channel array transistors (VCATs) with a gate-all-around (GAA) structure for dynamic random access memory applications. Initially, we propose a method for reliably implementing LER in GAA semiconductor devices. Next, we extend the method to more complex structures beyond the basic cylindrical GAA structure. Utilizing the proposed method, we investigate the impact of LER on various VCAT device configurations by examining DC performance metrics such as IOFF, IDS,LIN, IDS,SAT, VT,LIN, VT,SAT, IOV,LIN, and IOV,SAT. Additionally, we explore AC performance metrics (THOLD, TREAD, and TWRITE) through mixed-mode simulations. The results show that the parameters influencing LER-induced fluctuations in VCATs vary depending on the transistor’s operating region (i.e., whether the transistor is turned on or not).","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3571-3580"},"PeriodicalIF":2.9,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1