2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)最新文献_第4页

Online and Offline Machine Learning for Industrial Design Flow Tuning: (Invited - ICCAD Special Session Paper) 在线和离线机器学习用于工业设计流程调整:(邀请- ICCAD特别会议论文)

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643577

M. Ziegler, Jihye Kwon, Hung-Yi Liu, L. Carloni

Modern logic and physical synthesis tools provide numerous options and parameters that can drastically affect design quality; however, the large number of options leads to a complex design space difficult for human designers to navigate. Fortunately, machine learning approaches and cloud computing environments are well suited for tackling complex parameter tuning problems like those seen in VLSI design flows. This paper proposes a holistic approach where online and offline machine learning approaches work together for tuning industrial design flows. We describe a system called SynTunSys (STS) that has been used to optimize multiple industrial high-performance processors. STS consists of an online system that optimizes designs and generates data for a recommender system that performs offline training and recommendation. Experimental results show the collaboration between STS online and offline machine learning systems as well as insight from human designers provide best-of-breed results. Finally, we discuss potential new directions for research on design flow tuning.

现代逻辑和物理合成工具提供了许多选项和参数，可以极大地影响设计质量;然而，大量的选择导致了一个复杂的设计空间，很难让人类设计师驾驭。幸运的是，机器学习方法和云计算环境非常适合处理复杂的参数调优问题，例如在VLSI设计流程中看到的问题。本文提出了一种整体方法，其中在线和离线机器学习方法一起工作以调整工业设计流程。我们描述了一个名为SynTunSys (STS)的系统，该系统已用于优化多个工业高性能处理器。STS由一个在线系统组成，该系统为执行离线培训和推荐的推荐系统优化设计和生成数据。实验结果表明，STS在线和离线机器学习系统之间的协作以及人类设计师的洞察力提供了最佳的结果。最后，讨论了设计流程调优的潜在研究方向。

{"title":"Online and Offline Machine Learning for Industrial Design Flow Tuning: (Invited - ICCAD Special Session Paper)","authors":"M. Ziegler, Jihye Kwon, Hung-Yi Liu, L. Carloni","doi":"10.1109/ICCAD51958.2021.9643577","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643577","url":null,"abstract":"Modern logic and physical synthesis tools provide numerous options and parameters that can drastically affect design quality; however, the large number of options leads to a complex design space difficult for human designers to navigate. Fortunately, machine learning approaches and cloud computing environments are well suited for tackling complex parameter tuning problems like those seen in VLSI design flows. This paper proposes a holistic approach where online and offline machine learning approaches work together for tuning industrial design flows. We describe a system called SynTunSys (STS) that has been used to optimize multiple industrial high-performance processors. STS consists of an online system that optimizes designs and generates data for a recommender system that performs offline training and recommendation. Experimental results show the collaboration between STS online and offline machine learning systems as well as insight from human designers provide best-of-breed results. Finally, we discuss potential new directions for research on design flow tuning.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125358481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

AutoMap: Automated Mapping of Security Properties Between Different Levels of Abstraction in Design Flow AutoMap:设计流程中不同抽象级别之间安全属性的自动映射

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643467

Bulbul Ahmed, Fahim Rahman, Nick Hooten, Farimah Farahmandi, M. Tehranipoor

The security of system-on-chip (SoC) designs is threatened by many vulnerabilities introduced by untrusted third-party IPs, and designers and CAD tools' lack of awareness of security requirements. Ensuring the security of an SoC has become highly challenging due to the diverse threat models, high design complexity, and lack of effective security-aware verification solutions. Moreover, new security vulnerabilities are introduced during the design transformation from higher to lower abstraction levels. As a result, security verification becomes a major bottleneck that should be performed at every level of design abstraction. Reducing the verification effort by mapping the security properties at different design stages could be an efficient solution to lower the total verification time if the new vulnerabilities introduced at different abstraction levels are addressed properly. To address this challenge, we introduce AutoMap that, in addition to the mapping, extends and expands the security properties to identify new vulnerabilities introduced when the design moves from higher-to lower-level abstraction. Starting at the higher abstraction level with a defined set of security properties for the target threat models, AutoMap automatically maps the properties to the lower levels of abstraction to reduce the verification effort. Furthermore, it extends and expands the properties to cover new vulnerabilities introduced by design transformations and updates to the lower abstraction level. We demonstrate AutoMap's efficacy by applying it to AES, RSA, and SHA256 at C++, RTL, and gate-level. We show that AutoMap effectively facilitates the detection of security vulnerabilities from different sources during the design transformation.

片上系统(SoC)设计的安全性受到不受信任的第三方ip引入的许多漏洞的威胁，以及设计人员和CAD工具缺乏对安全需求的认识。由于各种威胁模型、高设计复杂性以及缺乏有效的安全感知验证解决方案，确保SoC的安全性变得非常具有挑战性。此外，在从较高抽象级别到较低抽象级别的设计转换过程中引入了新的安全漏洞。因此，安全性验证成为应该在每个设计抽象级别执行的主要瓶颈。如果正确处理了在不同抽象级别引入的新漏洞，那么通过映射不同设计阶段的安全属性来减少验证工作可能是降低总验证时间的有效解决方案。为了应对这一挑战，我们引入了AutoMap，除了映射之外，它还扩展和扩展了安全属性，以识别当设计从高级抽象转移到低级抽象时引入的新漏洞。从更高的抽象级别开始，为目标威胁模型定义一组安全属性，AutoMap自动将这些属性映射到较低的抽象级别，以减少验证工作。此外，它扩展和扩展了属性，以涵盖由设计转换和更新引入的新漏洞到较低的抽象级别。我们通过将AutoMap应用于c++、RTL和门级的AES、RSA和SHA256来证明它的有效性。我们展示了AutoMap在设计转换过程中有效地促进了来自不同来源的安全漏洞的检测。

{"title":"AutoMap: Automated Mapping of Security Properties Between Different Levels of Abstraction in Design Flow","authors":"Bulbul Ahmed, Fahim Rahman, Nick Hooten, Farimah Farahmandi, M. Tehranipoor","doi":"10.1109/ICCAD51958.2021.9643467","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643467","url":null,"abstract":"The security of system-on-chip (SoC) designs is threatened by many vulnerabilities introduced by untrusted third-party IPs, and designers and CAD tools' lack of awareness of security requirements. Ensuring the security of an SoC has become highly challenging due to the diverse threat models, high design complexity, and lack of effective security-aware verification solutions. Moreover, new security vulnerabilities are introduced during the design transformation from higher to lower abstraction levels. As a result, security verification becomes a major bottleneck that should be performed at every level of design abstraction. Reducing the verification effort by mapping the security properties at different design stages could be an efficient solution to lower the total verification time if the new vulnerabilities introduced at different abstraction levels are addressed properly. To address this challenge, we introduce AutoMap that, in addition to the mapping, extends and expands the security properties to identify new vulnerabilities introduced when the design moves from higher-to lower-level abstraction. Starting at the higher abstraction level with a defined set of security properties for the target threat models, AutoMap automatically maps the properties to the lower levels of abstraction to reduce the verification effort. Furthermore, it extends and expands the properties to cover new vulnerabilities introduced by design transformations and updates to the lower abstraction level. We demonstrate AutoMap's efficacy by applying it to AES, RSA, and SHA256 at C++, RTL, and gate-level. We show that AutoMap effectively facilitates the detection of security vulnerabilities from different sources during the design transformation.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129594154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

An Optimal Algorithm for Splitter and Buffer Insertion in Adiabatic Quantum-Flux-Parametron Circuits 绝热量子通量参数电路中分路器和缓冲器插入的最优算法

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643456

Chao-Yuan Huang, Yi-Chen Chang, Ming-Jer Tsai, Tsung-Yi Ho

The Adiabatic Quantum-Flux-Parametron (AQFP), which benefits from low power consumption and rapid switching, is one of the rising superconducting logics. Due to the rapid switching, the delay of the inputs of an AQFP gate is strictly specified so that additional buffers are needed to synchronize the delay. Meanwhile, to maintain the symmetry layout of gates and reduce the undesired parasitic magnetic coupling, the AQFP cell library adopts the minimalist design method in which splitters are employed for the gates with multiple fan-outs. Thus, an AQFP circuit may demand numerous splitters and buffers, resulting in a considerable amount of power consumption and delay. This provides a motivation for proposing an effective splitter and buffer insertion algorithm for the AQFP circuits. In this paper, we propose a dynamic programming-based algorithm that provides an optimal splitter and buffer insertion for each wire of the input circuit. Experimental results show that our method is fast, and has a 10% reduction of additional Josephson Junctions (JJs) in the complicated circuits compared with the state-of-the-art method.

绝热量子通量参数管(AQFP)是新兴的超导逻辑之一，具有低功耗和快速开关的优点。由于快速切换，AQFP门的输入延迟是严格指定的，因此需要额外的缓冲区来同步延迟。同时，为了保持门的对称布局，减少不希望的寄生磁耦合，AQFP单元库采用极简设计方法，对具有多个扇出的门采用分路器。因此，一个AQFP电路可能需要大量的分配器和缓冲器，导致大量的功耗和延迟。这为提出一种有效的AQFP分配器和缓冲区插入算法提供了动力。在本文中，我们提出了一种基于动态规划的算法，该算法为输入电路的每条导线提供最佳的分配器和缓冲区插入。实验结果表明，与现有方法相比，我们的方法速度快，并且在复杂电路中减少了10%的额外约瑟夫森结(JJs)。

{"title":"An Optimal Algorithm for Splitter and Buffer Insertion in Adiabatic Quantum-Flux-Parametron Circuits","authors":"Chao-Yuan Huang, Yi-Chen Chang, Ming-Jer Tsai, Tsung-Yi Ho","doi":"10.1109/ICCAD51958.2021.9643456","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643456","url":null,"abstract":"The Adiabatic Quantum-Flux-Parametron (AQFP), which benefits from low power consumption and rapid switching, is one of the rising superconducting logics. Due to the rapid switching, the delay of the inputs of an AQFP gate is strictly specified so that additional buffers are needed to synchronize the delay. Meanwhile, to maintain the symmetry layout of gates and reduce the undesired parasitic magnetic coupling, the AQFP cell library adopts the minimalist design method in which splitters are employed for the gates with multiple fan-outs. Thus, an AQFP circuit may demand numerous splitters and buffers, resulting in a considerable amount of power consumption and delay. This provides a motivation for proposing an effective splitter and buffer insertion algorithm for the AQFP circuits. In this paper, we propose a dynamic programming-based algorithm that provides an optimal splitter and buffer insertion for each wire of the input circuit. Experimental results show that our method is fast, and has a 10% reduction of additional Josephson Junctions (JJs) in the complicated circuits compared with the state-of-the-art method.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128569035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A Circuit-Based SAT Solver for Logic Synthesis 基于电路的逻辑综合SAT求解器

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643505

He-Teng Zhang, Jie-Hong R. Jiang, A. Mishchenko

In recent years SAT solving has been widely used to implement various circuit transformations in logic synthesis. However, off-the-shelf CNF-based SAT solvers often have suboptimal performance on these challenging optimization problems. This paper describes an application-specific circuit-based SAT solver for logic synthesis. The solver is based on Glucose, a state-of-the-art CNF-based solver and adds a number of novel features, which make it run faster on multiple incremental SAT problems arising in redundancy removal and logic restructuring among others. In particular, the circuit structure of the problem instance is leveraged in a new way to guide variable decisions and to converge to a solution faster for both satisfiable and unsatisfiable instances. Experimental results indicate that the proposed solver leads to a 2-4x speedup, compared to the original Glucose.

近年来，SAT求解被广泛应用于逻辑综合中的各种电路变换。然而，现成的基于cnf的SAT求解器在这些具有挑战性的优化问题上往往表现不佳。本文介绍了一种基于专用电路的逻辑综合SAT求解器。该求解器基于葡萄糖，这是一种最先进的基于cnf的求解器，并增加了许多新功能，这使得它在冗余删除和逻辑重构等多个增量SAT问题上运行得更快。特别是，以一种新的方式利用问题实例的电路结构来指导变量决策，并更快地收敛到可满足和不可满足实例的解决方案。实验结果表明，与原来的葡萄糖相比，所提出的求解器的速度提高了2-4倍。

引用次数: 7

Generating Architecture-Level Abstractions from RTL Designs for Processors and Accelerators Part I: Determining Architectural State Variables 从处理器和加速器的RTL设计中生成体系结构级抽象。第一部分:确定体系结构状态变量

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643584

Yu Zeng, Bo-Yuan Huang, Hongce Zhang, Aarti Gupta, S. Malik

Today's Systems-on-Chips (SoCs) comprise general/special purpose programmable processors and specialized hardware modules referred to as accelerators. These accelerators serve as co-processors and are invoked through software or firmware. Thus, verifying SoCs requires co-verification of hardware with software/firmware. Co-verification using cycle-accurate hardware models is often not scalable, and requires hardware abstractions. Among various abstractions, architecture-level abstractions are very effective as they retain only the software visible state. An Instruction-Set Architecture (ISA) serves this role for processors and such ISA-like abstractions are also desirable for accelerators. Manually creating such abstractions for accelerators is tedious and error-prone, and there is a growing need for automation in deriving them from existing Register-Transfer Level (RTL) implementations. An important part of this automation is determining which state variables to retain in the abstract model. For processors and accelerators, this set of variables is naturally the Architectural State Variables (ASVs) - variables that are persistent across instructions. This paper presents the first work to automatically determine ASVs of processors and accelerators from their RTL implementations. We propose three novel algorithms based on different characteristics of ASVs. Each algorithm provides a sound abstraction, i.e., an over-approximate set of ASVs. The quality of the abstraction is measured by the size of the set of ASVs computed. Experiments on several processors and accelerators demonstrate that these algorithms perform best in different cases, and by combining them a high quality set of ASVs can be found in reasonable time.

今天的片上系统(soc)由通用/专用可编程处理器和专用硬件模块(称为加速器)组成。这些加速器充当协处理器，并通过软件或固件调用。因此，验证soc需要硬件与软件/固件的共同验证。使用周期精确的硬件模型的共同验证通常是不可伸缩的，并且需要硬件抽象。在各种抽象中，架构级抽象非常有效，因为它们只保留软件可见状态。指令集体系结构(Instruction-Set Architecture, ISA)为处理器提供这个角色，而类似ISA的抽象也适合于加速器。手动为加速器创建这样的抽象是乏味且容易出错的，并且越来越需要从现有的Register-Transfer Level (RTL)实现中派生出这些抽象。此自动化的一个重要部分是确定在抽象模型中保留哪些状态变量。对于处理器和加速器，这组变量自然是架构状态变量(asv)——跨指令持久的变量。本文首先介绍了从处理器和加速器的RTL实现中自动确定asv的工作。基于asv的不同特征，提出了三种新的算法。每个算法都提供了一个合理的抽象，即一组超近似的asv。抽象的质量是通过计算的asv集合的大小来衡量的。在多个处理器和加速器上的实验表明，这些算法在不同的情况下表现最好，并且通过组合它们可以在合理的时间内找到高质量的asv集。

{"title":"Generating Architecture-Level Abstractions from RTL Designs for Processors and Accelerators Part I: Determining Architectural State Variables","authors":"Yu Zeng, Bo-Yuan Huang, Hongce Zhang, Aarti Gupta, S. Malik","doi":"10.1109/ICCAD51958.2021.9643584","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643584","url":null,"abstract":"Today's Systems-on-Chips (SoCs) comprise general/special purpose programmable processors and specialized hardware modules referred to as accelerators. These accelerators serve as co-processors and are invoked through software or firmware. Thus, verifying SoCs requires co-verification of hardware with software/firmware. Co-verification using cycle-accurate hardware models is often not scalable, and requires hardware abstractions. Among various abstractions, architecture-level abstractions are very effective as they retain only the software visible state. An Instruction-Set Architecture (ISA) serves this role for processors and such ISA-like abstractions are also desirable for accelerators. Manually creating such abstractions for accelerators is tedious and error-prone, and there is a growing need for automation in deriving them from existing Register-Transfer Level (RTL) implementations. An important part of this automation is determining which state variables to retain in the abstract model. For processors and accelerators, this set of variables is naturally the Architectural State Variables (ASVs) - variables that are persistent across instructions. This paper presents the first work to automatically determine ASVs of processors and accelerators from their RTL implementations. We propose three novel algorithms based on different characteristics of ASVs. Each algorithm provides a sound abstraction, i.e., an over-approximate set of ASVs. The quality of the abstraction is measured by the size of the set of ASVs computed. Experiments on several processors and accelerators demonstrate that these algorithms perform best in different cases, and by combining them a high quality set of ASVs can be found in reasonable time.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127134926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Rerec: In-ReRAM Acceleration with Access-Aware Mapping for Personalized Recommendation 基于访问感知映射的个性化推荐的In-ReRAM加速

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643573

Yitu Wang, Zhenhua Zhu, Fan Chen, Mingyuan Ma, Guohao Dai, Yu Wang, Hai Helen Li, Yiran Chen

Personalized recommendation systems are widely used in many Internet services. The sparse embedding lookup in recommendation models dominates the computational cost of inference due to its intensive irregular memory accesses. Applying resistive random access memory (ReRAM) based process-in-memory (PIM) architecture to accelerate recommendation processing can avoid data movements caused by off-chip memory accesses. However, naïve adoption of ReRAM-based DNN accelerators leads to low computation parallelism and severe under-utilization of computing resources, which is caused by the fine-grained inner-product in feature interaction. In this paper, we propose Rerec, an architecture-algorithm co-designed accelerator, which specializes in fine-grained ReRAM-based inner-product engines with access-aware mapping algorithm for recommendation inference. At the architecture level, we reduce the size and increase the amount of crossbars. The crossbars are fully-connected by Analog-to-Digital Converters (ADCs) in one inner-product engine, which can adapt to the fine-grained and irregular computational patterns and improve the processing parallelism. We further explore trade-offs of (i) crossbar size vs. hardware utilization, and (ii) ADC implementation vs. area/energy efficiency to optimize the design. At the algorithm level, we propose a novel access-aware mapping (AAM) algorithm to optimize resource allocations. Our AAM algorithm tackles the problems of (i) the workload imbalance and (ii) the long recommendation inference latency induced by the great variance of access frequency of embedding vectors. Experimental results show that Rerecachieves 7.69x speedup compared with a ReRAM-based baseline design. Compared to CPU and the state-of-the-art recommendation accelerator, Rerecdemonstrates 29.26x and 3.48x performance improvement, respectively.

个性化推荐系统被广泛应用于许多互联网服务中。推荐模型中的稀疏嵌入查找由于其大量的不规则内存访问，在推理的计算开销中占主导地位。采用基于电阻式随机存取存储器(ReRAM)的内存中进程(PIM)架构来加速推荐处理，可以避免片外存储器访问引起的数据移动。然而naïve采用基于reram的DNN加速器，由于特征交互中的细粒度内积导致计算并行度低，计算资源利用率严重不足。在本文中，我们提出了rereec，一个架构-算法协同设计的加速器，它专门研究基于细粒度reram的内部产品引擎，并使用访问感知映射算法进行推荐推理。在体系结构级别，我们减小了横杆的大小，增加了横杆的数量。交叉条由模数转换器(adc)完全连接在一个内积引擎中，可以适应细粒度和不规则的计算模式，提高处理并行性。我们进一步探讨了(i)交叉杆尺寸与硬件利用率的权衡，以及(ii) ADC实现与面积/能源效率的权衡，以优化设计。在算法层面，我们提出了一种新的访问感知映射(AAM)算法来优化资源分配。我们的AAM算法解决了嵌入向量访问频率差异大所导致的工作量不平衡和推荐推理延迟长的问题。实验结果表明，与基于reram的基准设计相比，rerecm实现了7.69倍的加速。与CPU和最先进的推荐加速器相比，rerecc的性能分别提高了29.26倍和3.48倍。

{"title":"Rerec: In-ReRAM Acceleration with Access-Aware Mapping for Personalized Recommendation","authors":"Yitu Wang, Zhenhua Zhu, Fan Chen, Mingyuan Ma, Guohao Dai, Yu Wang, Hai Helen Li, Yiran Chen","doi":"10.1109/ICCAD51958.2021.9643573","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643573","url":null,"abstract":"Personalized recommendation systems are widely used in many Internet services. The sparse embedding lookup in recommendation models dominates the computational cost of inference due to its intensive irregular memory accesses. Applying resistive random access memory (ReRAM) based process-in-memory (PIM) architecture to accelerate recommendation processing can avoid data movements caused by off-chip memory accesses. However, naïve adoption of ReRAM-based DNN accelerators leads to low computation parallelism and severe under-utilization of computing resources, which is caused by the fine-grained inner-product in feature interaction. In this paper, we propose Rerec, an architecture-algorithm co-designed accelerator, which specializes in fine-grained ReRAM-based inner-product engines with access-aware mapping algorithm for recommendation inference. At the architecture level, we reduce the size and increase the amount of crossbars. The crossbars are fully-connected by Analog-to-Digital Converters (ADCs) in one inner-product engine, which can adapt to the fine-grained and irregular computational patterns and improve the processing parallelism. We further explore trade-offs of (i) crossbar size vs. hardware utilization, and (ii) ADC implementation vs. area/energy efficiency to optimize the design. At the algorithm level, we propose a novel access-aware mapping (AAM) algorithm to optimize resource allocations. Our AAM algorithm tackles the problems of (i) the workload imbalance and (ii) the long recommendation inference latency induced by the great variance of access frequency of embedding vectors. Experimental results show that Rerecachieves 7.69x speedup compared with a ReRAM-based baseline design. Compared to CPU and the state-of-the-art recommendation accelerator, Rerecdemonstrates 29.26x and 3.48x performance improvement, respectively.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130459483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

AMF-Placer: High-Performance Analytical Mixed-size Placer for FPGA AMF-Placer:用于FPGA的高性能分析混合大小的Placer

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643574

Tingyuan Liang, Gengjie Chen, Jieru Zhao, Sharad Sinha, Wei Zhang

To enable the performance optimization of application mapping on modern field-programmable gate arrays (FPGAs), certain critical path portions of the designs might be prearranged into many multi-cell macros during synthesis. These movable macros with constraints of shape and resources lead to challenging mixed-size placement for FPGA designs which cannot be addressed by previous works of analytical placers. In this work, we propose AMF-Placer, an open-source Analytical Mixed-size FPGA placer supporting mixed-size placement on FPGA, with an interface to Xilinx Vivado. To speed up the convergence and improve the quality of the placement, AMF-Placer is equipped with a series of new techniques for wirelength optimization, cell spreading, packing, and legalization. Based on a set of the latest large open-source benchmarks from various domains for Xilinx Ultrascale FPGAs, experimental results indicate that AMF-Placer can improve HPWL by 20.4%-89.3% and reduce runtime by 8.0%-84.2%, compared to the baseline. Furthermore, utilizing the parallelism of the proposed algorithms, with 8 threads, the placement procedure can be accelerated by 2.41x on average.

为了在现代现场可编程门阵列(fpga)上实现应用映射的性能优化，设计的某些关键路径部分可能在合成过程中被预先安排到许多多单元宏中。这些具有形状和资源限制的可移动宏导致FPGA设计具有挑战性的混合尺寸放置，这是以前的分析放置器无法解决的。在这项工作中，我们提出了AMF-Placer，这是一个开源的分析型混合尺寸FPGA放砂器，支持FPGA上的混合尺寸放置，并具有与Xilinx Vivado的接口。为了加快收敛速度和提高放置质量，AMF-Placer配备了一系列新技术，用于无线优化、小区扩展、打包和合法化。基于Xilinx Ultrascale fpga各领域最新的大型开源基准测试，实验结果表明，与基线相比，AMF-Placer可将HPWL提高20.4% ~ 89.3%，将运行时间缩短8.0% ~ 84.2%。此外，利用所提出算法的并行性，在8个线程的情况下，放置过程平均可以加速2.41倍。

{"title":"AMF-Placer: High-Performance Analytical Mixed-size Placer for FPGA","authors":"Tingyuan Liang, Gengjie Chen, Jieru Zhao, Sharad Sinha, Wei Zhang","doi":"10.1109/ICCAD51958.2021.9643574","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643574","url":null,"abstract":"To enable the performance optimization of application mapping on modern field-programmable gate arrays (FPGAs), certain critical path portions of the designs might be prearranged into many multi-cell macros during synthesis. These movable macros with constraints of shape and resources lead to challenging mixed-size placement for FPGA designs which cannot be addressed by previous works of analytical placers. In this work, we propose AMF-Placer, an open-source Analytical Mixed-size FPGA placer supporting mixed-size placement on FPGA, with an interface to Xilinx Vivado. To speed up the convergence and improve the quality of the placement, AMF-Placer is equipped with a series of new techniques for wirelength optimization, cell spreading, packing, and legalization. Based on a set of the latest large open-source benchmarks from various domains for Xilinx Ultrascale FPGAs, experimental results indicate that AMF-Placer can improve HPWL by 20.4%-89.3% and reduce runtime by 8.0%-84.2%, compared to the baseline. Furthermore, utilizing the parallelism of the proposed algorithms, with 8 threads, the placement procedure can be accelerated by 2.41x on average.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116930343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Heterogeneous Manycore Architectures Enabled by Processing-in-Memory for Deep Learning: From CNNs to GNNs: (ICCAD Special Session Paper) 基于内存处理的异构多核深度学习架构:从cnn到GNNs (ICCAD特别会议论文)

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643559

Biresh Kumar Joardar, Aqeeb Iqbal Arka, J. Doppa, P. Pande, Hai Helen Li, K. Chakrabarty

Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architectures have recently become a popular architectural choice for deep-learning applications. ReRAM-based architectures can accelerate inferencing and training of deep learning algorithms and are more energy efficient compared to traditional GPUs. However, these architectures have various limitations that affect the model accuracy and performance. Moreover, the choice of the deep-learning application also imposes new design challenges that must be addressed to achieve high performance. In this paper, we present the advantages and challenges associated with ReRAM-based PIM architectures by considering Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs) as important application domains. We also outline methods that can be used to address these challenges.

基于电阻随机存取存储器(ReRAM)的内存中处理(PIM)架构最近成为深度学习应用程序的流行架构选择。基于rerram的架构可以加速深度学习算法的推理和训练，并且与传统gpu相比更节能。然而，这些体系结构有各种各样的限制，这些限制会影响模型的准确性和性能。此外，深度学习应用程序的选择也带来了新的设计挑战，必须解决这些挑战才能实现高性能。在本文中，我们通过考虑卷积神经网络(cnn)和图神经网络(gnn)作为重要的应用领域，提出了基于reram的PIM架构的优势和挑战。我们还概述了可用于应对这些挑战的方法。

引用次数: 2

A Framework for Area-efficient Multi-task BERT Execution on ReRAM-based Accelerators 基于reram加速器的区域高效多任务BERT执行框架

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643471

Myeonggu Kang, Hyein Shin, Jaekang Shin, L. Kim

With the superior algorithmic performances, BERT has become the de-facto standard model for various NLP tasks. Accordingly, multiple BERT models have been adopted on a single system, which is also called multi-task BERT. Although the ReRAM-based accelerator shows the sufficient potential to execute a single BERT model by adopting in-memory computation, processing multi-task BERT on the ReRAM-based accelerator extremely increases the overall area due to multiple fine-tuned models. In this paper, we propose a framework for area-efficient multi-task BERT execution on the ReRAM-based accelerator. Firstly, we decompose the fine-tuned model of each task by utilizing the base-model. After that, we propose a two-stage weight compressor, which shrinks the decomposed models by analyzing the properties of the ReRAM-based accelerator. We also present a profiler to generate hyper-parameters for the proposed compressor. By sharing the base-model and compressing the decomposed models, the proposed framework successfully reduces the total area of the ReRAM-based accelerator without an additional training procedure. It achieves a 0.26 x area than baseline while maintaining the algorithmic performances.

由于其优越的算法性能，BERT已经成为各种NLP任务的事实上的标准模型。因此，在单个系统上采用多个BERT模型，也称为多任务BERT。尽管基于reram的加速器通过采用内存计算显示出足够的潜力来执行单个BERT模型，但由于多个微调模型，在基于reram的加速器上处理多任务BERT极大地增加了总体面积。在本文中，我们提出了一个在基于reram的加速器上执行区域高效多任务BERT的框架。首先，利用基本模型对每个任务的微调模型进行分解;在此基础上，通过分析基于rram的加速器的特性，提出了一种两级权重压缩器，对分解后的模型进行压缩。我们还提供了一个分析器来为所提出的压缩机生成超参数。通过共享基本模型和压缩分解模型，该框架成功地减少了基于reram的加速器的总面积，而无需额外的训练过程。在保持算法性能的同时，实现了比基线0.26 x的面积。

{"title":"A Framework for Area-efficient Multi-task BERT Execution on ReRAM-based Accelerators","authors":"Myeonggu Kang, Hyein Shin, Jaekang Shin, L. Kim","doi":"10.1109/ICCAD51958.2021.9643471","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643471","url":null,"abstract":"With the superior algorithmic performances, BERT has become the de-facto standard model for various NLP tasks. Accordingly, multiple BERT models have been adopted on a single system, which is also called multi-task BERT. Although the ReRAM-based accelerator shows the sufficient potential to execute a single BERT model by adopting in-memory computation, processing multi-task BERT on the ReRAM-based accelerator extremely increases the overall area due to multiple fine-tuned models. In this paper, we propose a framework for area-efficient multi-task BERT execution on the ReRAM-based accelerator. Firstly, we decompose the fine-tuned model of each task by utilizing the base-model. After that, we propose a two-stage weight compressor, which shrinks the decomposed models by analyzing the properties of the ReRAM-based accelerator. We also present a profiler to generate hyper-parameters for the proposed compressor. By sharing the base-model and compressing the decomposed models, the proposed framework successfully reduces the total area of the ReRAM-based accelerator without an additional training procedure. It achieves a 0.26 x area than baseline while maintaining the algorithmic performances.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116407809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

HeteroCPPR: Accelerating Common Path Pessimism Removal with Heterogeneous CPU-GPU Parallelism HeteroCPPR:利用异构CPU-GPU并行加速公共路径悲观去除

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643457

Zizheng Guo, Tsung-Wei Huang, Yibo Lin

Common path pessimism removal (CPPR) is a key step to eliminating unwanted pessimism during static timing analysis (STA). Unwanted pessimism will force designers and optimization algorithms to waste a significant yet unnecessary amount of effort on fixing paths that meet the intended timing constraints. However, CPPR is extremely time-consuming and can incur 10–100× runtime overheads to complete. Existing solutions for speeding up CPPR are architecturally constrained by CPU-only parallelism, and their runtimes do not scale beyond 8–16 cores. In this paper, we introduce HeteroCPPR, a new algorithm to accelerate CPPR by harnessing the power of heterogeneous CPU-GPU parallelism. We devise an efficient CPU-GPU task decomposition strategy and highly optimized GPU kernels to handle CPPR that scales to large numbers of paths. Also, HeteroCPPR can scale to multiple GPUs. As an example, HeteroCPPR is up to 16×faster than a state-of-the-art CPU-parallel CPPR algorithm for completing the analysis of 10K post-CPPR critical paths in a million-gate design under a machine of 40 CPUs and 4 GPUs.

共同路径悲观情绪消除(CPPR)是静态时序分析(STA)中消除不必要悲观情绪的关键步骤。不必要的悲观主义将迫使设计师和优化算法浪费大量但不必要的精力来确定满足预期时间限制的路径。但是，CPPR非常耗时，可能会导致10 - 100倍的运行时开销。现有的加速CPPR的解决方案在架构上受到仅cpu并行性的限制，并且它们的运行时不能扩展到8-16核以上。本文介绍了一种利用CPU-GPU异构并行性来加速CPPR的新算法——HeteroCPPR。我们设计了一个高效的CPU-GPU任务分解策略和高度优化的GPU内核来处理扩展到大量路径的CPPR。此外，HeteroCPPR可以扩展到多个gpu。例如，在40个cpu和4个gpu的机器上，在完成百万门设计中10K后CPPR关键路径的分析时，HeteroCPPR比最先进的cpu并行CPPR算法高达16×faster。

{"title":"HeteroCPPR: Accelerating Common Path Pessimism Removal with Heterogeneous CPU-GPU Parallelism","authors":"Zizheng Guo, Tsung-Wei Huang, Yibo Lin","doi":"10.1109/ICCAD51958.2021.9643457","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643457","url":null,"abstract":"Common path pessimism removal (CPPR) is a key step to eliminating unwanted pessimism during static timing analysis (STA). Unwanted pessimism will force designers and optimization algorithms to waste a significant yet unnecessary amount of effort on fixing paths that meet the intended timing constraints. However, CPPR is extremely time-consuming and can incur 10–100× runtime overheads to complete. Existing solutions for speeding up CPPR are architecturally constrained by CPU-only parallelism, and their runtimes do not scale beyond 8–16 cores. In this paper, we introduce HeteroCPPR, a new algorithm to accelerate CPPR by harnessing the power of heterogeneous CPU-GPU parallelism. We devise an efficient CPU-GPU task decomposition strategy and highly optimized GPU kernels to handle CPPR that scales to large numbers of paths. Also, HeteroCPPR can scale to multiple GPUs. As an example, HeteroCPPR is up to 16×faster than a state-of-the-art CPU-parallel CPPR algorithm for completing the analysis of 10K post-CPPR critical paths in a million-gate design under a machine of 40 CPUs and 4 GPUs.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117307705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10