2020 57th ACM/IEEE Design Automation Conference (DAC)最新文献

英文中文

Adaptive Layout Decomposition with Graph Embedding Neural Networks 基于图嵌入神经网络的自适应布局分解

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218706

Jialu Xia, Yuzhe Ma, Jialu Li, Yibo Lin, Bei Yu

Multiple patterning lithography decomposition (MPLD) has been widely investigated, but so far there is no decomposer that dominates others in terms of both the optimality and the efficiency. This observation motivates us exploring how to adaptively select the most suitable MPLD strategy for a given layout graph, which is non-trivial and still an open problem. In this paper, we propose a layout decomposition framework based on graph convolutional networks to obtain the graph embeddings of the layout. The graph embeddings are used for graph library construction, decomposer selection and graph matching. Experimental results show that our graph embedding based framework can achieve optimal decompositions under negligible runtime overhead even comparing with fast but non-optimal heuristics.

多模式光刻分解(MPLD)已被广泛研究，但到目前为止，还没有一种分解方法在最优性和效率方面都占主导地位。这一观察促使我们探索如何自适应地为给定的布局图选择最合适的MPLD策略，这是一个非平凡的问题，仍然是一个开放的问题。本文提出了一种基于图卷积网络的布局分解框架，以获取布局的图嵌入。图嵌入用于图库构建、分配器选择和图匹配。实验结果表明，与快速但非最优的启发式算法相比，基于图嵌入的框架可以在可忽略的运行时间开销下实现最优分解。

引用次数: 9

Exploring a Bayesian Optimization Framework Compatible with Digital Standard Flow for Soft-Error-Tolerant Circuit 软容错电路中兼容数字标准流程的贝叶斯优化框架的探索

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218696

Yan Li, Xiao-yang Zeng, Zhengqi Gao, Liyu Lin, Jun Tao, Jun Han, Xu Cheng, M. Tahoori, Xiaoyang Zeng

Soft error is a major reliability concern in advanced technology nodes. Although mitigating Soft Error Rate (SER) will inevitably sacrifice area and power, few studies paid attention to optimization methods to explore trade-offs between area, power and SER. This paper proposes an optimization framework based on Bayesian approach for soft-error-tolerant circuit design. It comprises two steps:1) data preprocessing and 2) Bayesian optimization. In the preprocessing step, a strategy incorporating k-means algorithm and a novel sequencing algorithm is used to cluster Flip-Flops (FFs) with similar SER in order to reduce the dimensionality for the subsequent step. Bayesian Neural Network (BNN) is the applied surrogate model for acquiring the posterior distribution of three design metrics, while the Lower confidence bound (LCB) functions are employed as acquisition functions to select the next point based on BNN when optimizing. Finally, the non-dominated sorting genetic algorithm (NSGA-II) is used to search the Pareto Optimal Front (POF) solutions of three LCB functions. Experimental results demonstrate the proposed framework has a 1.4x improvement in accuracy and a 70% reduction in SER with acceptable increases in power and area.

软误差是先进技术节点的主要可靠性问题。虽然降低软错误率必然会牺牲面积和功率，但很少有研究关注优化方法来探索面积、功率和SER之间的权衡。本文提出了一种基于贝叶斯方法的软容错电路设计优化框架。它包括两个步骤:1)数据预处理和2)贝叶斯优化。在预处理步骤中，采用一种结合k-means算法和一种新的排序算法的策略，对具有相似SER的触发器(FFs)进行聚类，以降低后续步骤的维数。采用贝叶斯神经网络(BNN)作为获取三个设计指标后验分布的代理模型，在优化时采用低置信度界(LCB)函数作为获取函数，基于贝叶斯神经网络选择下一个点。最后，利用非支配排序遗传算法(NSGA-II)搜索三个LCB函数的Pareto最优前解(POF)。实验结果表明，该框架的精度提高了1.4倍，SER降低了70%，功率和面积都有了可接受的提高。

{"title":"Exploring a Bayesian Optimization Framework Compatible with Digital Standard Flow for Soft-Error-Tolerant Circuit","authors":"Yan Li, Xiao-yang Zeng, Zhengqi Gao, Liyu Lin, Jun Tao, Jun Han, Xu Cheng, M. Tahoori, Xiaoyang Zeng","doi":"10.1109/DAC18072.2020.9218696","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218696","url":null,"abstract":"Soft error is a major reliability concern in advanced technology nodes. Although mitigating Soft Error Rate (SER) will inevitably sacrifice area and power, few studies paid attention to optimization methods to explore trade-offs between area, power and SER. This paper proposes an optimization framework based on Bayesian approach for soft-error-tolerant circuit design. It comprises two steps:1) data preprocessing and 2) Bayesian optimization. In the preprocessing step, a strategy incorporating k-means algorithm and a novel sequencing algorithm is used to cluster Flip-Flops (FFs) with similar SER in order to reduce the dimensionality for the subsequent step. Bayesian Neural Network (BNN) is the applied surrogate model for acquiring the posterior distribution of three design metrics, while the Lower confidence bound (LCB) functions are employed as acquisition functions to select the next point based on BNN when optimizing. Finally, the non-dominated sorting genetic algorithm (NSGA-II) is used to search the Pareto Optimal Front (POF) solutions of three LCB functions. Experimental results demonstrate the proposed framework has a 1.4x improvement in accuracy and a 70% reduction in SER with acceptable increases in power and area.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128851978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Creating an Agile Hardware Design Flow 创建一个敏捷的硬件设计流程

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218553

Rick Bahr, Clark W. Barrett, Nikhil Bhagdikar, Alex Carsello, Ross G. Daly, Caleb Donovick, David Durst, K. Fatahalian, Kathleen Feng, P. Hanrahan, Teguh Hofstee, M. Horowitz, Dillon Huff, Fredrik Kjolstad, Taeyoung Kong, Qiaoyi Liu, Makai Mann, J. Melchert, Ankita Nayak, Aina Niemetz, Gedeon Nyengele, Priyanka Raina, S. Richardson, Rajsekhar Setaluri, Jeff Setter, Kavya Sreedhar, Maxwell Strange, James J. Thomas, Christopher Torng, Lenny Truong, Nestan Tsiskaridze, Keyi Zhang

Although an agile approach is standard for software design, how to properly adapt this method to hardware is still an open question. This work addresses this question while building a system on chip (SoC) with specialized accelerators. Rather than using a traditional waterfall design flow, which starts by studying the application to be accelerated, we begin by constructing a complete flow from an application expressed in a high-level domain-specific language (DSL), in our case Halide, to a generic coarse-grained reconfigurable array (CGRA). As our under-standing of the application grows, the CGRA design evolves, and we have developed a suite of tools that tune application code, the compiler, and the CGRA to increase the efficiency of the resulting implementation. To meet our continued need to update parts of the system while maintaining the end-to-end flow, we have created DSL-based hardware generators that not only provide the Verilog needed for the implementation of the CGRA, but also create the collateral that the compiler/mapper/place and route system needs to configure its operation. This work provides a systematic approach for desiging and evolving high-performance and energy-efficient hardware-software systems for any application domain.

尽管敏捷方法是软件设计的标准，但如何将这种方法适当地应用于硬件仍然是一个悬而未决的问题。这项工作在构建具有专用加速器的片上系统(SoC)时解决了这个问题。我们不是使用传统的瀑布式设计流，它从研究要加速的应用程序开始，而是首先构建一个完整的流，从用高级领域特定语言(DSL)表示的应用程序(在我们的例子中是Halide)到通用的粗粒度可重构数组(CGRA)。随着我们对应用程序的理解不断加深，CGRA设计也在不断发展，我们开发了一套工具来调优应用程序代码、编译器和CGRA，以提高最终实现的效率。为了满足我们在保持端到端流程的同时更新系统部分的持续需求，我们创建了基于dsl的硬件生成器，它不仅提供了实现CGRA所需的Verilog，而且还创建了编译器/映射器/位置和路由系统配置其操作所需的附带工具。这项工作为任何应用领域设计和发展高性能和节能的软硬件系统提供了一种系统的方法。

{"title":"Creating an Agile Hardware Design Flow","authors":"Rick Bahr, Clark W. Barrett, Nikhil Bhagdikar, Alex Carsello, Ross G. Daly, Caleb Donovick, David Durst, K. Fatahalian, Kathleen Feng, P. Hanrahan, Teguh Hofstee, M. Horowitz, Dillon Huff, Fredrik Kjolstad, Taeyoung Kong, Qiaoyi Liu, Makai Mann, J. Melchert, Ankita Nayak, Aina Niemetz, Gedeon Nyengele, Priyanka Raina, S. Richardson, Rajsekhar Setaluri, Jeff Setter, Kavya Sreedhar, Maxwell Strange, James J. Thomas, Christopher Torng, Lenny Truong, Nestan Tsiskaridze, Keyi Zhang","doi":"10.1109/DAC18072.2020.9218553","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218553","url":null,"abstract":"Although an agile approach is standard for software design, how to properly adapt this method to hardware is still an open question. This work addresses this question while building a system on chip (SoC) with specialized accelerators. Rather than using a traditional waterfall design flow, which starts by studying the application to be accelerated, we begin by constructing a complete flow from an application expressed in a high-level domain-specific language (DSL), in our case Halide, to a generic coarse-grained reconfigurable array (CGRA). As our under-standing of the application grows, the CGRA design evolves, and we have developed a suite of tools that tune application code, the compiler, and the CGRA to increase the efficiency of the resulting implementation. To meet our continued need to update parts of the system while maintaining the end-to-end flow, we have created DSL-based hardware generators that not only provide the Verilog needed for the implementation of the CGRA, but also create the collateral that the compiler/mapper/place and route system needs to configure its operation. This work provides a systematic approach for desiging and evolving high-performance and energy-efficient hardware-software systems for any application domain.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125343268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

A Two-way SRAM Array based Accelerator for Deep Neural Network On-chip Training 基于双向SRAM阵列的深度神经网络片上训练加速器

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218524

Hongwu Jiang, Shanshi Huang, Xiaochen Peng, Jian-Wei Su, Yen-Chi Chou, Wei-Hsing Huang, Ta-Wei Liu, Ruhui Liu, Meng-Fan Chang, Shimeng Yu

On-chip training of large-scale deep neural networks (DNNs) is challenging due to computational complexity and resource limitation. Compute-in-memory (CIM) architecture exploits the analog computation inside the memory array to speed up the vectormatrix multiplication (VMM) and alleviate the memory bottleneck. However, existing CIM prototype chips, in particular, SRAM-based accelerators target at implementing low-precision inference engine only. In this work, we propose a two-way SRAM array design that could perform bi-directional in-memory VMM with minimum hardware overhead. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. We taped-out and validated proposed two-way SRAM array design in TSMC 28nm process. Based on the silicon measurement data on CIM macro, we explore the hardware performance for the entire architecture for DNN on-chip training. The experimental data shows that proposed accelerator can achieve energy efficiency of ~3.2 TOPS/W, >1000 FPS and >300 FPS for ResNet and DenseNet training on ImageNet, respectively.

由于计算复杂度和资源限制，大规模深度神经网络(dnn)的片上训练具有挑战性。内存计算(CIM)架构利用内存阵列内的模拟计算来加快向量矩阵乘法(VMM)运算速度，缓解内存瓶颈。然而，现有的CIM原型芯片，特别是基于sram的加速器，只针对实现低精度的推理引擎。在这项工作中，我们提出了一种双向SRAM阵列设计，可以在最小的硬件开销下执行双向内存VMM。针对反向传播中的负输入，提出了一种新的有符号数乘法解。我们在TSMC的28nm制程上完成并验证了所提出的双向SRAM阵列设计。基于CIM宏上的硅测量数据，我们探索了DNN片上训练的整个架构的硬件性能。实验数据表明，该加速器在ImageNet上进行ResNet和DenseNet训练的能量效率分别达到~3.2 TOPS/W、>1000 FPS和>300 FPS。

{"title":"A Two-way SRAM Array based Accelerator for Deep Neural Network On-chip Training","authors":"Hongwu Jiang, Shanshi Huang, Xiaochen Peng, Jian-Wei Su, Yen-Chi Chou, Wei-Hsing Huang, Ta-Wei Liu, Ruhui Liu, Meng-Fan Chang, Shimeng Yu","doi":"10.1109/DAC18072.2020.9218524","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218524","url":null,"abstract":"On-chip training of large-scale deep neural networks (DNNs) is challenging due to computational complexity and resource limitation. Compute-in-memory (CIM) architecture exploits the analog computation inside the memory array to speed up the vectormatrix multiplication (VMM) and alleviate the memory bottleneck. However, existing CIM prototype chips, in particular, SRAM-based accelerators target at implementing low-precision inference engine only. In this work, we propose a two-way SRAM array design that could perform bi-directional in-memory VMM with minimum hardware overhead. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. We taped-out and validated proposed two-way SRAM array design in TSMC 28nm process. Based on the silicon measurement data on CIM macro, we explore the hardware performance for the entire architecture for DNN on-chip training. The experimental data shows that proposed accelerator can achieve energy efficiency of ~3.2 TOPS/W, >1000 FPS and >300 FPS for ResNet and DenseNet training on ImageNet, respectively.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125347223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

The Power of Simulation for Equivalence Checking in Quantum Computing 量子计算中等效检验的模拟能力

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218563

Lukas Burgholzer, R. Wille

The rapid rate of progress in the physical realization of quantum computers sparked the development of elaborate design flows for quantum computations on such devices. Each stage of these flows comes with its own representation of the intended functionality. Ensuring that each design step preserves this intended functionality is of utmost importance. However, existing solutions for equivalence checking of quantum computations heavily struggle with the complexity of the underlying problem and, thus, no conclusions on the equivalence may be reached with reasonable efforts in many cases. In this work, we uncover the power of simulation for equivalence checking in quantum computing. We show that, in contrast to classical computing, it is in general not necessary to compare the complete representation of the respective computations. Even small errors frequently affect the entire representation and, thus, can be detected within a couple of simulations. The resulting equivalence checking flow substantially improves upon the state of the art by drastically accelerating the detection of errors or providing a highly probable estimate of the operations’ equivalence.

量子计算机物理实现的快速进展激发了在此类设备上进行量子计算的详细设计流程的发展。这些流的每个阶段都有自己的预期功能表示。确保每个设计步骤都保留预期的功能是至关重要的。然而，现有的量子计算等效性检验方案与潜在问题的复杂性严重斗争，因此，在许多情况下，通过合理的努力可能无法得出等效性的结论。在这项工作中，我们揭示了模拟在量子计算中等效检查的力量。我们表明，与经典计算相比，通常没有必要比较各自计算的完整表示。即使是很小的错误也经常影响整个表示，因此可以在几次模拟中检测到。由此产生的等效性检查流程通过大大加快错误检测或提供操作等效性的高度可能估计，大大改进了现有技术的状态。

{"title":"The Power of Simulation for Equivalence Checking in Quantum Computing","authors":"Lukas Burgholzer, R. Wille","doi":"10.1109/DAC18072.2020.9218563","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218563","url":null,"abstract":"The rapid rate of progress in the physical realization of quantum computers sparked the development of elaborate design flows for quantum computations on such devices. Each stage of these flows comes with its own representation of the intended functionality. Ensuring that each design step preserves this intended functionality is of utmost importance. However, existing solutions for equivalence checking of quantum computations heavily struggle with the complexity of the underlying problem and, thus, no conclusions on the equivalence may be reached with reasonable efforts in many cases. In this work, we uncover the power of simulation for equivalence checking in quantum computing. We show that, in contrast to classical computing, it is in general not necessary to compare the complete representation of the respective computations. Even small errors frequently affect the entire representation and, thus, can be detected within a couple of simulations. The resulting equivalence checking flow substantially improves upon the state of the art by drastically accelerating the detection of errors or providing a highly probable estimate of the operations’ equivalence.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125350275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Predictable Memory-CPU Co-Scheduling with Support for Latency-Sensitive Tasks 支持延迟敏感任务的可预测内存- cpu协同调度

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218640

Daniel Casini, P. Pazzaglia, Alessandro Biondi, M. Natale, G. Buttazzo

Predictable execution models have been proposed over the years to achieve contention-free execution of real-time tasks by preloading data into dedicated local memories. In this way, memory access delays can be hidden by delegating a DMA engine to perform memory transfers in parallel with processor execution. Nevertheless, state-of-the-art protocols introduce additional blocking due to priority inversion, which may severely penalize latency-sensitive applications and even worsen the system schedulability with respect to the use of classical scheduling schemes. This paper proposes a new protocol that allows hiding memory transfer delays while reducing priority inversion, thus favoring the schedulability of latency-sensitive tasks. The corresponding analysis is formulated as an optimization problem. Experimental results show the advantages of the proposed protocol against state-of-the-art solutions.

多年来，人们提出了可预测的执行模型，通过将数据预加载到专用的本地内存中来实现实时任务的无争用执行。通过这种方式，可以通过委派DMA引擎在处理器执行的同时执行内存传输来隐藏内存访问延迟。然而，由于优先级反转，最先进的协议引入了额外的阻塞，这可能会严重惩罚对延迟敏感的应用程序，甚至与使用经典调度方案相比，会使系统的可调度性恶化。本文提出了一种新的协议，可以在减少优先级反转的同时隐藏内存传输延迟，从而有利于延迟敏感任务的可调度性。相应的分析被表述为一个优化问题。实验结果表明，与现有的解决方案相比，所提出的协议具有优势。

引用次数: 10

Building End-to-End IoT Applications with QoS Guarantees 构建具有QoS保证的端到端物联网应用

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218564

A. Hamann, Selma Saidi, David Ginthoer, C. Wietfeld, D. Ziegenbein

Many industrial players are currently challenged in building distributed CPS and IoT applications with stringent endto-end QoS requirements. Examples are Vehicle-to-X applications, Advanced Driver-Assistance Systems (ADAS) or functionalities in the Industrial Internet of Things (IIoT). Currently, there is no comprehensive solution allowing to efficiently program, deploy, and operate such distributed applications. This paper will focus on real-time concerns, in building distributed CPS and IoT systems. Thereby, the focus lies, on the one hand, on mechanisms required inside of the IoT (compute) nodes, and, on the other hand, on communication protocols such as TSN and 5G connecting them. In the authors’ view, the required building blocks for a first end-to-end technology stack are available. However, their integration into a holistic framework is missing.

目前，许多行业参与者在构建具有严格端到端QoS要求的分布式CPS和物联网应用方面面临挑战。例如车辆到x应用程序，高级驾驶员辅助系统(ADAS)或工业物联网(IIoT)中的功能。目前，还没有全面的解决方案能够有效地编程、部署和操作这种分布式应用程序。本文将重点关注构建分布式CPS和物联网系统中的实时问题。因此，重点一方面在于物联网(计算)节点内部所需的机制，另一方面在于连接它们的TSN和5G等通信协议。在作者看来，第一个端到端技术堆栈所需的构建块是可用的。然而，它们没有整合到一个整体框架中。

引用次数: 7

Dynamic Information Flow Tracking for Embedded Binaries using SystemC-based Virtual Prototypes 基于systemc的虚拟原型的嵌入式二进制文件动态信息流跟踪

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218494

Pascal Pieper, V. Herdt, Daniel Große, R. Drechsler

Avoiding security vulnerabilities is very important for embedded systems. Dynamic Information Flow Tracking (DIFT) is a powerful technique to analyze SW with respect to security policies in order to protect the system against a broad range of security related exploits. However, existing DIFT approaches either do not exist for Virtual Prototypes (VPs) or fail to model complex hardware/software interactions.In this paper, we present a novel approach that enables early and accurate DIFT of binaries targeting embedded systems with custom peripherals. Leveraging the SystemC framework, our DIFT engine tracks accurate data flow information alongside the program execution to detect violations of security policies at run-time. We demonstrate the effectiveness and applicability of our approach by extensive experiments.

避免安全漏洞对于嵌入式系统来说是非常重要的。动态信息流跟踪(Dynamic Information Flow Tracking, DIFT)是一种强大的技术，可以根据安全策略分析软件，从而保护系统免受各种与安全相关的攻击。然而，现有的DIFT方法要么不存在于虚拟原型(vp)，要么不能对复杂的硬件/软件交互建模。在本文中，我们提出了一种新颖的方法，可以实现针对具有自定义外设的嵌入式系统的二进制文件的早期和准确的DIFT。利用SystemC框架，我们的DIFT引擎在程序执行过程中跟踪准确的数据流信息，从而在运行时检测对安全策略的违反。我们通过大量的实验证明了该方法的有效性和适用性。

引用次数: 14

A History-Based Auto-Tuning Framework for Fast and High-Performance DNN Design on GPU 基于历史的GPU快速高性能深度神经网络自调优框架

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218700

Jiandong Mu, Mengdi Wang, Lanbo Li, Jun Yang, Wei Lin, Wei Zhang

While Deep Neural Networks (DNNs) are becoming increasingly popular, there is a growing trend to accelerate the DNN applications on hardware platforms like GPUs, FPGAs, etc., to gain higher performance and efficiency. However, it is time-consuming to tune the performance for such platforms due to the large design space and the expensive cost to evaluate each design point. Although many tuning algorithms, such as XGBoost tuner and genetic algorithm (GA) tuner, have been proposed to guide the design space exploring process in the previous work, the timing issue still remains a critical problem. In this work, we propose a novel auto-tuning framework to optimize the DNN operator design on GPU by leveraging the tuning history efficiently in different scenarios. Our experiments show that we can achieve superior performance than the state-of-the-art work, such as auto-tuning framework TVM and the handcraft optimized library cuDNN, while reducing the searching time by 8.96x and 4.58x comparing with XGBoost tuner and GA tuner in TVM.

随着深度神经网络(Deep Neural Networks, DNN)越来越受欢迎，加速DNN在gpu、fpga等硬件平台上的应用，以获得更高的性能和效率的趋势也在不断增长。然而，由于设计空间大，评估每个设计点的成本昂贵，因此此类平台的性能调优非常耗时。虽然在之前的工作中已经提出了许多调谐算法，如XGBoost调谐器和遗传算法(GA)调谐器来指导设计空间探索过程，但时序问题仍然是一个关键问题。在这项工作中，我们提出了一种新的自动调整框架，通过在不同场景下有效地利用调整历史来优化GPU上的DNN算子设计。我们的实验表明，我们可以获得比目前最先进的工作，如自动调谐框架TVM和手工优化库cuDNN更好的性能，同时与TVM中的XGBoost调谐器和GA调谐器相比，搜索时间分别减少了8.96倍和4.58倍。

引用次数: 11

Dadu-CD: Fast and Efficient Processing-in-Memory Accelerator for Collision Detection Dadu-CD:快速高效的内存处理碰撞检测加速器

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218709

Yuxin Yang, Xiaoming Chen, Yinhe Han

Collision detection is a fundamental task in motion planning of robotics. Typically, the performance of collision detection is the bottleneck of an entire motion planning, and so does the energy consumption. Several hardware accelerators have been proposed for collision detection, which achieves higher performance and energy efficiency than general-purpose CPUs and GPUs. However, existing accelerators are still facing the limited memory bandwidth bottleneck, due to the large data volume required by the parallel processing cores and the limited DRAM bandwidth. In this work, we propose a novel collision detection accelerator by employing the processing-in-memory technique. We elaborate the in-memory processing architecture to fully utilize the internal bandwidth of DRAM banks. To make the algorithm and hardware suitable for in-memory processing to be highly efficient, a set of innovative software and hardware techniques are also proposed. Compared with a state-of-the-art ASIC-based collision detection accelerator, both performance and energy efficiency of our accelerator are significantly improved.

碰撞检测是机器人运动规划中的一项基本任务。通常，碰撞检测的性能是整个运动规划的瓶颈，能量消耗也是瓶颈。已经提出了几种用于碰撞检测的硬件加速器，它们比通用cpu和gpu实现了更高的性能和能效。然而，由于并行处理核需要的数据量大，而DRAM带宽有限，现有的加速器仍然面临着有限的内存带宽瓶颈。在这项工作中，我们提出了一种新的碰撞检测加速器，采用内存处理技术。为了充分利用DRAM组的内部带宽，我们详细设计了内存处理架构。为了使算法和硬件更高效地适用于内存处理，本文还提出了一套创新的软硬件技术。与目前最先进的基于asic的碰撞检测加速器相比，我们的加速器的性能和能效都有了显著提高。

引用次数: 9

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 57th ACM/IEEE Design Automation Conference (DAC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀