首页 > 最新文献

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)最新文献

英文 中文
A Circuit-Based SAT Solver for Logic Synthesis 基于电路的逻辑综合SAT求解器
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643505
He-Teng Zhang, Jie-Hong R. Jiang, A. Mishchenko
In recent years SAT solving has been widely used to implement various circuit transformations in logic synthesis. However, off-the-shelf CNF-based SAT solvers often have suboptimal performance on these challenging optimization problems. This paper describes an application-specific circuit-based SAT solver for logic synthesis. The solver is based on Glucose, a state-of-the-art CNF-based solver and adds a number of novel features, which make it run faster on multiple incremental SAT problems arising in redundancy removal and logic restructuring among others. In particular, the circuit structure of the problem instance is leveraged in a new way to guide variable decisions and to converge to a solution faster for both satisfiable and unsatisfiable instances. Experimental results indicate that the proposed solver leads to a 2-4x speedup, compared to the original Glucose.
近年来,SAT求解被广泛应用于逻辑综合中的各种电路变换。然而,现成的基于cnf的SAT求解器在这些具有挑战性的优化问题上往往表现不佳。本文介绍了一种基于专用电路的逻辑综合SAT求解器。该求解器基于葡萄糖,这是一种最先进的基于cnf的求解器,并增加了许多新功能,这使得它在冗余删除和逻辑重构等多个增量SAT问题上运行得更快。特别是,以一种新的方式利用问题实例的电路结构来指导变量决策,并更快地收敛到可满足和不可满足实例的解决方案。实验结果表明,与原来的葡萄糖相比,所提出的求解器的速度提高了2-4倍。
{"title":"A Circuit-Based SAT Solver for Logic Synthesis","authors":"He-Teng Zhang, Jie-Hong R. Jiang, A. Mishchenko","doi":"10.1109/ICCAD51958.2021.9643505","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643505","url":null,"abstract":"In recent years SAT solving has been widely used to implement various circuit transformations in logic synthesis. However, off-the-shelf CNF-based SAT solvers often have suboptimal performance on these challenging optimization problems. This paper describes an application-specific circuit-based SAT solver for logic synthesis. The solver is based on Glucose, a state-of-the-art CNF-based solver and adds a number of novel features, which make it run faster on multiple incremental SAT problems arising in redundancy removal and logic restructuring among others. In particular, the circuit structure of the problem instance is leveraged in a new way to guide variable decisions and to converge to a solution faster for both satisfiable and unsatisfiable instances. Experimental results indicate that the proposed solver leads to a 2-4x speedup, compared to the original Glucose.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124566525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Hotspot Detection via Multi-task Learning and Transformer Encoder 基于多任务学习和变压器编码器的热点检测
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643590
Binwu Zhu, Ran Chen, Xinyun Zhang, Fan Yang, Xuan Zeng, Bei Yu, Martin D. F. Wong
With the rapid development of semiconductors and the continuous scaling-down of circuit feature size, hotspot detection has become much more challenging and crucial as a critical step in the physical verification flow. In recent years, advanced deep learning techniques have spawned many frameworks for hotspot detection. However, most existing hotspot detectors can only detect defects arising in the central region of small clips, making the whole detection process time-consuming on large layouts. Some advanced hotspot detectors can detect multiple hotspots in a large area but need to propose potential defect regions, and a refinement step is required to locate the hotspot precisely. To simplify the procedure of multi-stage detectors, an end - to-end single-stage hotspot detector is proposed to identify hotspots on large scales without refining potential regions. Besides, multiple tasks are developed to learn various pattern topological features. Also, a feature aggregation module based on Transformer Encoder is designed to globally capture the relationship between different features, further enhancing the feature representation ability. Experimental results show that our proposed framework achieves higher accuracy over prior methods with faster inference speed.
随着半导体技术的快速发展和电路特征尺寸的不断缩小,热点检测作为物理验证流程中的关键步骤变得越来越具有挑战性和重要性。近年来,先进的深度学习技术催生了许多热点检测框架。然而,现有的热点检测器大多只能检测小夹片中心区域产生的缺陷,在大版图上,整个检测过程非常耗时。一些先进的热点检测器可以检测到大面积的多个热点,但需要提出潜在的缺陷区域,并且需要一个细化步骤来精确定位热点。为了简化多级探测过程,提出了一种端到端单级热点探测器,可以在不细化电位区域的情况下,在大范围内识别热点。此外,还开发了多个任务来学习各种模式拓扑特征。同时,设计了基于Transformer Encoder的特征聚合模块,全局捕获不同特征之间的关系,进一步增强了特征表示能力。实验结果表明,该框架比现有方法具有更高的准确率和更快的推理速度。
{"title":"Hotspot Detection via Multi-task Learning and Transformer Encoder","authors":"Binwu Zhu, Ran Chen, Xinyun Zhang, Fan Yang, Xuan Zeng, Bei Yu, Martin D. F. Wong","doi":"10.1109/ICCAD51958.2021.9643590","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643590","url":null,"abstract":"With the rapid development of semiconductors and the continuous scaling-down of circuit feature size, hotspot detection has become much more challenging and crucial as a critical step in the physical verification flow. In recent years, advanced deep learning techniques have spawned many frameworks for hotspot detection. However, most existing hotspot detectors can only detect defects arising in the central region of small clips, making the whole detection process time-consuming on large layouts. Some advanced hotspot detectors can detect multiple hotspots in a large area but need to propose potential defect regions, and a refinement step is required to locate the hotspot precisely. To simplify the procedure of multi-stage detectors, an end - to-end single-stage hotspot detector is proposed to identify hotspots on large scales without refining potential regions. Besides, multiple tasks are developed to learn various pattern topological features. Also, a feature aggregation module based on Transformer Encoder is designed to globally capture the relationship between different features, further enhancing the feature representation ability. Experimental results show that our proposed framework achieves higher accuracy over prior methods with faster inference speed.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124159522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
An Optimal Algorithm for Splitter and Buffer Insertion in Adiabatic Quantum-Flux-Parametron Circuits 绝热量子通量参数电路中分路器和缓冲器插入的最优算法
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643456
Chao-Yuan Huang, Yi-Chen Chang, Ming-Jer Tsai, Tsung-Yi Ho
The Adiabatic Quantum-Flux-Parametron (AQFP), which benefits from low power consumption and rapid switching, is one of the rising superconducting logics. Due to the rapid switching, the delay of the inputs of an AQFP gate is strictly specified so that additional buffers are needed to synchronize the delay. Meanwhile, to maintain the symmetry layout of gates and reduce the undesired parasitic magnetic coupling, the AQFP cell library adopts the minimalist design method in which splitters are employed for the gates with multiple fan-outs. Thus, an AQFP circuit may demand numerous splitters and buffers, resulting in a considerable amount of power consumption and delay. This provides a motivation for proposing an effective splitter and buffer insertion algorithm for the AQFP circuits. In this paper, we propose a dynamic programming-based algorithm that provides an optimal splitter and buffer insertion for each wire of the input circuit. Experimental results show that our method is fast, and has a 10% reduction of additional Josephson Junctions (JJs) in the complicated circuits compared with the state-of-the-art method.
绝热量子通量参数管(AQFP)是新兴的超导逻辑之一,具有低功耗和快速开关的优点。由于快速切换,AQFP门的输入延迟是严格指定的,因此需要额外的缓冲区来同步延迟。同时,为了保持门的对称布局,减少不希望的寄生磁耦合,AQFP单元库采用极简设计方法,对具有多个扇出的门采用分路器。因此,一个AQFP电路可能需要大量的分配器和缓冲器,导致大量的功耗和延迟。这为提出一种有效的AQFP分配器和缓冲区插入算法提供了动力。在本文中,我们提出了一种基于动态规划的算法,该算法为输入电路的每条导线提供最佳的分配器和缓冲区插入。实验结果表明,与现有方法相比,我们的方法速度快,并且在复杂电路中减少了10%的额外约瑟夫森结(JJs)。
{"title":"An Optimal Algorithm for Splitter and Buffer Insertion in Adiabatic Quantum-Flux-Parametron Circuits","authors":"Chao-Yuan Huang, Yi-Chen Chang, Ming-Jer Tsai, Tsung-Yi Ho","doi":"10.1109/ICCAD51958.2021.9643456","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643456","url":null,"abstract":"The Adiabatic Quantum-Flux-Parametron (AQFP), which benefits from low power consumption and rapid switching, is one of the rising superconducting logics. Due to the rapid switching, the delay of the inputs of an AQFP gate is strictly specified so that additional buffers are needed to synchronize the delay. Meanwhile, to maintain the symmetry layout of gates and reduce the undesired parasitic magnetic coupling, the AQFP cell library adopts the minimalist design method in which splitters are employed for the gates with multiple fan-outs. Thus, an AQFP circuit may demand numerous splitters and buffers, resulting in a considerable amount of power consumption and delay. This provides a motivation for proposing an effective splitter and buffer insertion algorithm for the AQFP circuits. In this paper, we propose a dynamic programming-based algorithm that provides an optimal splitter and buffer insertion for each wire of the input circuit. Experimental results show that our method is fast, and has a 10% reduction of additional Josephson Junctions (JJs) in the complicated circuits compared with the state-of-the-art method.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128569035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Rerec: In-ReRAM Acceleration with Access-Aware Mapping for Personalized Recommendation 基于访问感知映射的个性化推荐的In-ReRAM加速
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643573
Yitu Wang, Zhenhua Zhu, Fan Chen, Mingyuan Ma, Guohao Dai, Yu Wang, Hai Helen Li, Yiran Chen
Personalized recommendation systems are widely used in many Internet services. The sparse embedding lookup in recommendation models dominates the computational cost of inference due to its intensive irregular memory accesses. Applying resistive random access memory (ReRAM) based process-in-memory (PIM) architecture to accelerate recommendation processing can avoid data movements caused by off-chip memory accesses. However, naïve adoption of ReRAM-based DNN accelerators leads to low computation parallelism and severe under-utilization of computing resources, which is caused by the fine-grained inner-product in feature interaction. In this paper, we propose Rerec, an architecture-algorithm co-designed accelerator, which specializes in fine-grained ReRAM-based inner-product engines with access-aware mapping algorithm for recommendation inference. At the architecture level, we reduce the size and increase the amount of crossbars. The crossbars are fully-connected by Analog-to-Digital Converters (ADCs) in one inner-product engine, which can adapt to the fine-grained and irregular computational patterns and improve the processing parallelism. We further explore trade-offs of (i) crossbar size vs. hardware utilization, and (ii) ADC implementation vs. area/energy efficiency to optimize the design. At the algorithm level, we propose a novel access-aware mapping (AAM) algorithm to optimize resource allocations. Our AAM algorithm tackles the problems of (i) the workload imbalance and (ii) the long recommendation inference latency induced by the great variance of access frequency of embedding vectors. Experimental results show that Rerecachieves 7.69x speedup compared with a ReRAM-based baseline design. Compared to CPU and the state-of-the-art recommendation accelerator, Rerecdemonstrates 29.26x and 3.48x performance improvement, respectively.
个性化推荐系统被广泛应用于许多互联网服务中。推荐模型中的稀疏嵌入查找由于其大量的不规则内存访问,在推理的计算开销中占主导地位。采用基于电阻式随机存取存储器(ReRAM)的内存中进程(PIM)架构来加速推荐处理,可以避免片外存储器访问引起的数据移动。然而naïve采用基于reram的DNN加速器,由于特征交互中的细粒度内积导致计算并行度低,计算资源利用率严重不足。在本文中,我们提出了rereec,一个架构-算法协同设计的加速器,它专门研究基于细粒度reram的内部产品引擎,并使用访问感知映射算法进行推荐推理。在体系结构级别,我们减小了横杆的大小,增加了横杆的数量。交叉条由模数转换器(adc)完全连接在一个内积引擎中,可以适应细粒度和不规则的计算模式,提高处理并行性。我们进一步探讨了(i)交叉杆尺寸与硬件利用率的权衡,以及(ii) ADC实现与面积/能源效率的权衡,以优化设计。在算法层面,我们提出了一种新的访问感知映射(AAM)算法来优化资源分配。我们的AAM算法解决了嵌入向量访问频率差异大所导致的工作量不平衡和推荐推理延迟长的问题。实验结果表明,与基于reram的基准设计相比,rerecm实现了7.69倍的加速。与CPU和最先进的推荐加速器相比,rerecc的性能分别提高了29.26倍和3.48倍。
{"title":"Rerec: In-ReRAM Acceleration with Access-Aware Mapping for Personalized Recommendation","authors":"Yitu Wang, Zhenhua Zhu, Fan Chen, Mingyuan Ma, Guohao Dai, Yu Wang, Hai Helen Li, Yiran Chen","doi":"10.1109/ICCAD51958.2021.9643573","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643573","url":null,"abstract":"Personalized recommendation systems are widely used in many Internet services. The sparse embedding lookup in recommendation models dominates the computational cost of inference due to its intensive irregular memory accesses. Applying resistive random access memory (ReRAM) based process-in-memory (PIM) architecture to accelerate recommendation processing can avoid data movements caused by off-chip memory accesses. However, naïve adoption of ReRAM-based DNN accelerators leads to low computation parallelism and severe under-utilization of computing resources, which is caused by the fine-grained inner-product in feature interaction. In this paper, we propose Rerec, an architecture-algorithm co-designed accelerator, which specializes in fine-grained ReRAM-based inner-product engines with access-aware mapping algorithm for recommendation inference. At the architecture level, we reduce the size and increase the amount of crossbars. The crossbars are fully-connected by Analog-to-Digital Converters (ADCs) in one inner-product engine, which can adapt to the fine-grained and irregular computational patterns and improve the processing parallelism. We further explore trade-offs of (i) crossbar size vs. hardware utilization, and (ii) ADC implementation vs. area/energy efficiency to optimize the design. At the algorithm level, we propose a novel access-aware mapping (AAM) algorithm to optimize resource allocations. Our AAM algorithm tackles the problems of (i) the workload imbalance and (ii) the long recommendation inference latency induced by the great variance of access frequency of embedding vectors. Experimental results show that Rerecachieves 7.69x speedup compared with a ReRAM-based baseline design. Compared to CPU and the state-of-the-art recommendation accelerator, Rerecdemonstrates 29.26x and 3.48x performance improvement, respectively.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130459483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Generating Architecture-Level Abstractions from RTL Designs for Processors and Accelerators Part I: Determining Architectural State Variables 从处理器和加速器的RTL设计中生成体系结构级抽象。第一部分:确定体系结构状态变量
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643584
Yu Zeng, Bo-Yuan Huang, Hongce Zhang, Aarti Gupta, S. Malik
Today's Systems-on-Chips (SoCs) comprise general/special purpose programmable processors and specialized hardware modules referred to as accelerators. These accelerators serve as co-processors and are invoked through software or firmware. Thus, verifying SoCs requires co-verification of hardware with software/firmware. Co-verification using cycle-accurate hardware models is often not scalable, and requires hardware abstractions. Among various abstractions, architecture-level abstractions are very effective as they retain only the software visible state. An Instruction-Set Architecture (ISA) serves this role for processors and such ISA-like abstractions are also desirable for accelerators. Manually creating such abstractions for accelerators is tedious and error-prone, and there is a growing need for automation in deriving them from existing Register-Transfer Level (RTL) implementations. An important part of this automation is determining which state variables to retain in the abstract model. For processors and accelerators, this set of variables is naturally the Architectural State Variables (ASVs) - variables that are persistent across instructions. This paper presents the first work to automatically determine ASVs of processors and accelerators from their RTL implementations. We propose three novel algorithms based on different characteristics of ASVs. Each algorithm provides a sound abstraction, i.e., an over-approximate set of ASVs. The quality of the abstraction is measured by the size of the set of ASVs computed. Experiments on several processors and accelerators demonstrate that these algorithms perform best in different cases, and by combining them a high quality set of ASVs can be found in reasonable time.
今天的片上系统(soc)由通用/专用可编程处理器和专用硬件模块(称为加速器)组成。这些加速器充当协处理器,并通过软件或固件调用。因此,验证soc需要硬件与软件/固件的共同验证。使用周期精确的硬件模型的共同验证通常是不可伸缩的,并且需要硬件抽象。在各种抽象中,架构级抽象非常有效,因为它们只保留软件可见状态。指令集体系结构(Instruction-Set Architecture, ISA)为处理器提供这个角色,而类似ISA的抽象也适合于加速器。手动为加速器创建这样的抽象是乏味且容易出错的,并且越来越需要从现有的Register-Transfer Level (RTL)实现中派生出这些抽象。此自动化的一个重要部分是确定在抽象模型中保留哪些状态变量。对于处理器和加速器,这组变量自然是架构状态变量(asv)——跨指令持久的变量。本文首先介绍了从处理器和加速器的RTL实现中自动确定asv的工作。基于asv的不同特征,提出了三种新的算法。每个算法都提供了一个合理的抽象,即一组超近似的asv。抽象的质量是通过计算的asv集合的大小来衡量的。在多个处理器和加速器上的实验表明,这些算法在不同的情况下表现最好,并且通过组合它们可以在合理的时间内找到高质量的asv集。
{"title":"Generating Architecture-Level Abstractions from RTL Designs for Processors and Accelerators Part I: Determining Architectural State Variables","authors":"Yu Zeng, Bo-Yuan Huang, Hongce Zhang, Aarti Gupta, S. Malik","doi":"10.1109/ICCAD51958.2021.9643584","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643584","url":null,"abstract":"Today's Systems-on-Chips (SoCs) comprise general/special purpose programmable processors and specialized hardware modules referred to as accelerators. These accelerators serve as co-processors and are invoked through software or firmware. Thus, verifying SoCs requires co-verification of hardware with software/firmware. Co-verification using cycle-accurate hardware models is often not scalable, and requires hardware abstractions. Among various abstractions, architecture-level abstractions are very effective as they retain only the software visible state. An Instruction-Set Architecture (ISA) serves this role for processors and such ISA-like abstractions are also desirable for accelerators. Manually creating such abstractions for accelerators is tedious and error-prone, and there is a growing need for automation in deriving them from existing Register-Transfer Level (RTL) implementations. An important part of this automation is determining which state variables to retain in the abstract model. For processors and accelerators, this set of variables is naturally the Architectural State Variables (ASVs) - variables that are persistent across instructions. This paper presents the first work to automatically determine ASVs of processors and accelerators from their RTL implementations. We propose three novel algorithms based on different characteristics of ASVs. Each algorithm provides a sound abstraction, i.e., an over-approximate set of ASVs. The quality of the abstraction is measured by the size of the set of ASVs computed. Experiments on several processors and accelerators demonstrate that these algorithms perform best in different cases, and by combining them a high quality set of ASVs can be found in reasonable time.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127134926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Fast and Accurate PPA Modeling with Transfer Learning 快速和准确的PPA建模与迁移学习
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643533
W. R. Davis, P. Franzon, Luis Francisco, Bill Huggins, Rajeev Jain
The power, performance and area (PPA) of digital blocks can vary 10:1 based on their synthesis, place, and route tool recipes. With rapid increase in number of PVT corners and complexity of logic functions approaching 10M gates, industry has an acute need to minimize the human resources, compute servers, and EDA licenses needed to achieve a Pareto optimal recipe. We first present models for fast accurate PPA prediction that can reduce the manual optimization iterations with EDA tools. Secondly we investigate techniques to automate the PPA optimization using evolutionary algorithms. For PPA prediction, a baseline model is trained on a known design using Latin hypercube sample runs of the EDA tool, and transfer learning is then used to train the model for an unseen design. For a known design the baseline needed 150 training runs to achieve a 95% accuracy. With transfer learning the same accuracy was achieved on a different (unseen) design in only 15 runs indicating the viability of transfer learning to generalize PPA models. The PPA optimization technique, based on evolutionary algorithms, effectively combines the PPA modeling and optimization. Our approach reached the same PPA solution as human designers in the same or fewer runs for a CORTEX-M0 system design. This shows potential for automating the recipe optimization without needing more runs than a human designer would need.
数字块的功率、性能和面积(PPA)可以根据其合成、放置和路由工具配方变化10:1。随着PVT拐角数量的快速增加和逻辑功能的复杂性接近10M门,业界迫切需要最大限度地减少实现Pareto最优配方所需的人力资源、计算服务器和EDA许可。我们首先提出了快速准确的PPA预测模型,可以减少EDA工具的手动优化迭代。其次,我们研究了使用进化算法自动化PPA优化的技术。对于PPA预测,使用EDA工具的拉丁超立方体样本运行在已知设计上训练基线模型,然后使用迁移学习来训练未知设计的模型。对于已知的设计,基线需要150次训练才能达到95%的准确率。通过迁移学习,在15次运行中,在不同的(未见过的)设计上实现了相同的精度,这表明迁移学习推广PPA模型的可行性。基于进化算法的PPA优化技术有效地将PPA建模与优化相结合。对于CORTEX-M0系统设计,我们的方法在相同或更少的运行中达到了与人类设计人员相同的PPA解决方案。这显示了自动化配方优化的潜力,而不需要比人工设计人员更多的运行。
{"title":"Fast and Accurate PPA Modeling with Transfer Learning","authors":"W. R. Davis, P. Franzon, Luis Francisco, Bill Huggins, Rajeev Jain","doi":"10.1109/ICCAD51958.2021.9643533","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643533","url":null,"abstract":"The power, performance and area (PPA) of digital blocks can vary 10:1 based on their synthesis, place, and route tool recipes. With rapid increase in number of PVT corners and complexity of logic functions approaching 10M gates, industry has an acute need to minimize the human resources, compute servers, and EDA licenses needed to achieve a Pareto optimal recipe. We first present models for fast accurate PPA prediction that can reduce the manual optimization iterations with EDA tools. Secondly we investigate techniques to automate the PPA optimization using evolutionary algorithms. For PPA prediction, a baseline model is trained on a known design using Latin hypercube sample runs of the EDA tool, and transfer learning is then used to train the model for an unseen design. For a known design the baseline needed 150 training runs to achieve a 95% accuracy. With transfer learning the same accuracy was achieved on a different (unseen) design in only 15 runs indicating the viability of transfer learning to generalize PPA models. The PPA optimization technique, based on evolutionary algorithms, effectively combines the PPA modeling and optimization. Our approach reached the same PPA solution as human designers in the same or fewer runs for a CORTEX-M0 system design. This shows potential for automating the recipe optimization without needing more runs than a human designer would need.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"103 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113954482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Algorithm and Hardware Co-design for Deep Learning-powered Channel Decoder: A Case Study 基于深度学习的信道解码器算法与硬件协同设计:一个案例研究
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643510
Boyang Zhang, Yang Sui, Lingyi Huang, Siyu Liao, Chunhua Deng, Bo Yuan
Channel decoder is a key component module in many communication systems. Recently, neural networks-based channel decoders have been actively investigated because of the great potential of their data-driven decoding procedure. However, as the intersection among machine learning, information theory and hardware design, the efficient algorithm and hardware codesign of deep learning-powered channel decoder has not been well studied. This paper is a first step towards exploring the efficient DNN-enabled channel decoders, from a joint perspective of algorithm and hardware. We first revisit our recently proposed doubly residual neural decoder. By introducing the advanced architectural topology on the decoder design, the overall error-correcting performance can be significantly improved. Based on this algorithm, we further develop the corresponding systolic array-based hardware architecture for the DRN decoder. The corresponding FPGA implementation for our DRN decoder on short LDPC code is also developed.
信道解码器是许多通信系统的关键组件模块。近年来,基于神经网络的信道解码器由于其数据驱动解码过程的巨大潜力而受到积极的研究。然而,作为机器学习、信息论和硬件设计的交叉点,深度学习驱动信道解码器的高效算法和硬件协同设计还没有得到很好的研究。本文是从算法和硬件的联合角度探索有效的dnn信道解码器的第一步。我们首先重温我们最近提出的双残差神经解码器。通过在解码器设计中引入先进的体系结构拓扑,可以显著提高解码器的整体纠错性能。在此基础上,我们进一步开发了相应的基于收缩阵列的DRN解码器硬件架构。本文还开发了相应的短LDPC码DRN解码器的FPGA实现。
{"title":"Algorithm and Hardware Co-design for Deep Learning-powered Channel Decoder: A Case Study","authors":"Boyang Zhang, Yang Sui, Lingyi Huang, Siyu Liao, Chunhua Deng, Bo Yuan","doi":"10.1109/ICCAD51958.2021.9643510","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643510","url":null,"abstract":"Channel decoder is a key component module in many communication systems. Recently, neural networks-based channel decoders have been actively investigated because of the great potential of their data-driven decoding procedure. However, as the intersection among machine learning, information theory and hardware design, the efficient algorithm and hardware codesign of deep learning-powered channel decoder has not been well studied. This paper is a first step towards exploring the efficient DNN-enabled channel decoders, from a joint perspective of algorithm and hardware. We first revisit our recently proposed doubly residual neural decoder. By introducing the advanced architectural topology on the decoder design, the overall error-correcting performance can be significantly improved. Based on this algorithm, we further develop the corresponding systolic array-based hardware architecture for the DRN decoder. The corresponding FPGA implementation for our DRN decoder on short LDPC code is also developed.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114442518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
AutoMap: Automated Mapping of Security Properties Between Different Levels of Abstraction in Design Flow AutoMap:设计流程中不同抽象级别之间安全属性的自动映射
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643467
Bulbul Ahmed, Fahim Rahman, Nick Hooten, Farimah Farahmandi, M. Tehranipoor
The security of system-on-chip (SoC) designs is threatened by many vulnerabilities introduced by untrusted third-party IPs, and designers and CAD tools' lack of awareness of security requirements. Ensuring the security of an SoC has become highly challenging due to the diverse threat models, high design complexity, and lack of effective security-aware verification solutions. Moreover, new security vulnerabilities are introduced during the design transformation from higher to lower abstraction levels. As a result, security verification becomes a major bottleneck that should be performed at every level of design abstraction. Reducing the verification effort by mapping the security properties at different design stages could be an efficient solution to lower the total verification time if the new vulnerabilities introduced at different abstraction levels are addressed properly. To address this challenge, we introduce AutoMap that, in addition to the mapping, extends and expands the security properties to identify new vulnerabilities introduced when the design moves from higher-to lower-level abstraction. Starting at the higher abstraction level with a defined set of security properties for the target threat models, AutoMap automatically maps the properties to the lower levels of abstraction to reduce the verification effort. Furthermore, it extends and expands the properties to cover new vulnerabilities introduced by design transformations and updates to the lower abstraction level. We demonstrate AutoMap's efficacy by applying it to AES, RSA, and SHA256 at C++, RTL, and gate-level. We show that AutoMap effectively facilitates the detection of security vulnerabilities from different sources during the design transformation.
片上系统(SoC)设计的安全性受到不受信任的第三方ip引入的许多漏洞的威胁,以及设计人员和CAD工具缺乏对安全需求的认识。由于各种威胁模型、高设计复杂性以及缺乏有效的安全感知验证解决方案,确保SoC的安全性变得非常具有挑战性。此外,在从较高抽象级别到较低抽象级别的设计转换过程中引入了新的安全漏洞。因此,安全性验证成为应该在每个设计抽象级别执行的主要瓶颈。如果正确处理了在不同抽象级别引入的新漏洞,那么通过映射不同设计阶段的安全属性来减少验证工作可能是降低总验证时间的有效解决方案。为了应对这一挑战,我们引入了AutoMap,除了映射之外,它还扩展和扩展了安全属性,以识别当设计从高级抽象转移到低级抽象时引入的新漏洞。从更高的抽象级别开始,为目标威胁模型定义一组安全属性,AutoMap自动将这些属性映射到较低的抽象级别,以减少验证工作。此外,它扩展和扩展了属性,以涵盖由设计转换和更新引入的新漏洞到较低的抽象级别。我们通过将AutoMap应用于c++、RTL和门级的AES、RSA和SHA256来证明它的有效性。我们展示了AutoMap在设计转换过程中有效地促进了来自不同来源的安全漏洞的检测。
{"title":"AutoMap: Automated Mapping of Security Properties Between Different Levels of Abstraction in Design Flow","authors":"Bulbul Ahmed, Fahim Rahman, Nick Hooten, Farimah Farahmandi, M. Tehranipoor","doi":"10.1109/ICCAD51958.2021.9643467","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643467","url":null,"abstract":"The security of system-on-chip (SoC) designs is threatened by many vulnerabilities introduced by untrusted third-party IPs, and designers and CAD tools' lack of awareness of security requirements. Ensuring the security of an SoC has become highly challenging due to the diverse threat models, high design complexity, and lack of effective security-aware verification solutions. Moreover, new security vulnerabilities are introduced during the design transformation from higher to lower abstraction levels. As a result, security verification becomes a major bottleneck that should be performed at every level of design abstraction. Reducing the verification effort by mapping the security properties at different design stages could be an efficient solution to lower the total verification time if the new vulnerabilities introduced at different abstraction levels are addressed properly. To address this challenge, we introduce AutoMap that, in addition to the mapping, extends and expands the security properties to identify new vulnerabilities introduced when the design moves from higher-to lower-level abstraction. Starting at the higher abstraction level with a defined set of security properties for the target threat models, AutoMap automatically maps the properties to the lower levels of abstraction to reduce the verification effort. Furthermore, it extends and expands the properties to cover new vulnerabilities introduced by design transformations and updates to the lower abstraction level. We demonstrate AutoMap's efficacy by applying it to AES, RSA, and SHA256 at C++, RTL, and gate-level. We show that AutoMap effectively facilitates the detection of security vulnerabilities from different sources during the design transformation.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129594154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Framework for Area-efficient Multi-task BERT Execution on ReRAM-based Accelerators 基于reram加速器的区域高效多任务BERT执行框架
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643471
Myeonggu Kang, Hyein Shin, Jaekang Shin, L. Kim
With the superior algorithmic performances, BERT has become the de-facto standard model for various NLP tasks. Accordingly, multiple BERT models have been adopted on a single system, which is also called multi-task BERT. Although the ReRAM-based accelerator shows the sufficient potential to execute a single BERT model by adopting in-memory computation, processing multi-task BERT on the ReRAM-based accelerator extremely increases the overall area due to multiple fine-tuned models. In this paper, we propose a framework for area-efficient multi-task BERT execution on the ReRAM-based accelerator. Firstly, we decompose the fine-tuned model of each task by utilizing the base-model. After that, we propose a two-stage weight compressor, which shrinks the decomposed models by analyzing the properties of the ReRAM-based accelerator. We also present a profiler to generate hyper-parameters for the proposed compressor. By sharing the base-model and compressing the decomposed models, the proposed framework successfully reduces the total area of the ReRAM-based accelerator without an additional training procedure. It achieves a 0.26 x area than baseline while maintaining the algorithmic performances.
由于其优越的算法性能,BERT已经成为各种NLP任务的事实上的标准模型。因此,在单个系统上采用多个BERT模型,也称为多任务BERT。尽管基于reram的加速器通过采用内存计算显示出足够的潜力来执行单个BERT模型,但由于多个微调模型,在基于reram的加速器上处理多任务BERT极大地增加了总体面积。在本文中,我们提出了一个在基于reram的加速器上执行区域高效多任务BERT的框架。首先,利用基本模型对每个任务的微调模型进行分解;在此基础上,通过分析基于rram的加速器的特性,提出了一种两级权重压缩器,对分解后的模型进行压缩。我们还提供了一个分析器来为所提出的压缩机生成超参数。通过共享基本模型和压缩分解模型,该框架成功地减少了基于reram的加速器的总面积,而无需额外的训练过程。在保持算法性能的同时,实现了比基线0.26 x的面积。
{"title":"A Framework for Area-efficient Multi-task BERT Execution on ReRAM-based Accelerators","authors":"Myeonggu Kang, Hyein Shin, Jaekang Shin, L. Kim","doi":"10.1109/ICCAD51958.2021.9643471","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643471","url":null,"abstract":"With the superior algorithmic performances, BERT has become the de-facto standard model for various NLP tasks. Accordingly, multiple BERT models have been adopted on a single system, which is also called multi-task BERT. Although the ReRAM-based accelerator shows the sufficient potential to execute a single BERT model by adopting in-memory computation, processing multi-task BERT on the ReRAM-based accelerator extremely increases the overall area due to multiple fine-tuned models. In this paper, we propose a framework for area-efficient multi-task BERT execution on the ReRAM-based accelerator. Firstly, we decompose the fine-tuned model of each task by utilizing the base-model. After that, we propose a two-stage weight compressor, which shrinks the decomposed models by analyzing the properties of the ReRAM-based accelerator. We also present a profiler to generate hyper-parameters for the proposed compressor. By sharing the base-model and compressing the decomposed models, the proposed framework successfully reduces the total area of the ReRAM-based accelerator without an additional training procedure. It achieves a 0.26 x area than baseline while maintaining the algorithmic performances.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116407809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
HeteroCPPR: Accelerating Common Path Pessimism Removal with Heterogeneous CPU-GPU Parallelism HeteroCPPR:利用异构CPU-GPU并行加速公共路径悲观去除
Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643457
Zizheng Guo, Tsung-Wei Huang, Yibo Lin
Common path pessimism removal (CPPR) is a key step to eliminating unwanted pessimism during static timing analysis (STA). Unwanted pessimism will force designers and optimization algorithms to waste a significant yet unnecessary amount of effort on fixing paths that meet the intended timing constraints. However, CPPR is extremely time-consuming and can incur 10–100× runtime overheads to complete. Existing solutions for speeding up CPPR are architecturally constrained by CPU-only parallelism, and their runtimes do not scale beyond 8–16 cores. In this paper, we introduce HeteroCPPR, a new algorithm to accelerate CPPR by harnessing the power of heterogeneous CPU-GPU parallelism. We devise an efficient CPU-GPU task decomposition strategy and highly optimized GPU kernels to handle CPPR that scales to large numbers of paths. Also, HeteroCPPR can scale to multiple GPUs. As an example, HeteroCPPR is up to 16×faster than a state-of-the-art CPU-parallel CPPR algorithm for completing the analysis of 10K post-CPPR critical paths in a million-gate design under a machine of 40 CPUs and 4 GPUs.
共同路径悲观情绪消除(CPPR)是静态时序分析(STA)中消除不必要悲观情绪的关键步骤。不必要的悲观主义将迫使设计师和优化算法浪费大量但不必要的精力来确定满足预期时间限制的路径。但是,CPPR非常耗时,可能会导致10 - 100倍的运行时开销。现有的加速CPPR的解决方案在架构上受到仅cpu并行性的限制,并且它们的运行时不能扩展到8-16核以上。本文介绍了一种利用CPU-GPU异构并行性来加速CPPR的新算法——HeteroCPPR。我们设计了一个高效的CPU-GPU任务分解策略和高度优化的GPU内核来处理扩展到大量路径的CPPR。此外,HeteroCPPR可以扩展到多个gpu。例如,在40个cpu和4个gpu的机器上,在完成百万门设计中10K后CPPR关键路径的分析时,HeteroCPPR比最先进的cpu并行CPPR算法高达16×faster。
{"title":"HeteroCPPR: Accelerating Common Path Pessimism Removal with Heterogeneous CPU-GPU Parallelism","authors":"Zizheng Guo, Tsung-Wei Huang, Yibo Lin","doi":"10.1109/ICCAD51958.2021.9643457","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643457","url":null,"abstract":"Common path pessimism removal (CPPR) is a key step to eliminating unwanted pessimism during static timing analysis (STA). Unwanted pessimism will force designers and optimization algorithms to waste a significant yet unnecessary amount of effort on fixing paths that meet the intended timing constraints. However, CPPR is extremely time-consuming and can incur 10–100× runtime overheads to complete. Existing solutions for speeding up CPPR are architecturally constrained by CPU-only parallelism, and their runtimes do not scale beyond 8–16 cores. In this paper, we introduce HeteroCPPR, a new algorithm to accelerate CPPR by harnessing the power of heterogeneous CPU-GPU parallelism. We devise an efficient CPU-GPU task decomposition strategy and highly optimized GPU kernels to handle CPPR that scales to large numbers of paths. Also, HeteroCPPR can scale to multiple GPUs. As an example, HeteroCPPR is up to 16×faster than a state-of-the-art CPU-parallel CPPR algorithm for completing the analysis of 10K post-CPPR critical paths in a million-gate design under a machine of 40 CPUs and 4 GPUs.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117307705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1