Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218706
Jialu Xia, Yuzhe Ma, Jialu Li, Yibo Lin, Bei Yu
Multiple patterning lithography decomposition (MPLD) has been widely investigated, but so far there is no decomposer that dominates others in terms of both the optimality and the efficiency. This observation motivates us exploring how to adaptively select the most suitable MPLD strategy for a given layout graph, which is non-trivial and still an open problem. In this paper, we propose a layout decomposition framework based on graph convolutional networks to obtain the graph embeddings of the layout. The graph embeddings are used for graph library construction, decomposer selection and graph matching. Experimental results show that our graph embedding based framework can achieve optimal decompositions under negligible runtime overhead even comparing with fast but non-optimal heuristics.
{"title":"Adaptive Layout Decomposition with Graph Embedding Neural Networks","authors":"Jialu Xia, Yuzhe Ma, Jialu Li, Yibo Lin, Bei Yu","doi":"10.1109/DAC18072.2020.9218706","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218706","url":null,"abstract":"Multiple patterning lithography decomposition (MPLD) has been widely investigated, but so far there is no decomposer that dominates others in terms of both the optimality and the efficiency. This observation motivates us exploring how to adaptively select the most suitable MPLD strategy for a given layout graph, which is non-trivial and still an open problem. In this paper, we propose a layout decomposition framework based on graph convolutional networks to obtain the graph embeddings of the layout. The graph embeddings are used for graph library construction, decomposer selection and graph matching. Experimental results show that our graph embedding based framework can achieve optimal decompositions under negligible runtime overhead even comparing with fast but non-optimal heuristics.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128794143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218696
Yan Li, Xiao-yang Zeng, Zhengqi Gao, Liyu Lin, Jun Tao, Jun Han, Xu Cheng, M. Tahoori, Xiaoyang Zeng
Soft error is a major reliability concern in advanced technology nodes. Although mitigating Soft Error Rate (SER) will inevitably sacrifice area and power, few studies paid attention to optimization methods to explore trade-offs between area, power and SER. This paper proposes an optimization framework based on Bayesian approach for soft-error-tolerant circuit design. It comprises two steps:1) data preprocessing and 2) Bayesian optimization. In the preprocessing step, a strategy incorporating k-means algorithm and a novel sequencing algorithm is used to cluster Flip-Flops (FFs) with similar SER in order to reduce the dimensionality for the subsequent step. Bayesian Neural Network (BNN) is the applied surrogate model for acquiring the posterior distribution of three design metrics, while the Lower confidence bound (LCB) functions are employed as acquisition functions to select the next point based on BNN when optimizing. Finally, the non-dominated sorting genetic algorithm (NSGA-II) is used to search the Pareto Optimal Front (POF) solutions of three LCB functions. Experimental results demonstrate the proposed framework has a 1.4x improvement in accuracy and a 70% reduction in SER with acceptable increases in power and area.
{"title":"Exploring a Bayesian Optimization Framework Compatible with Digital Standard Flow for Soft-Error-Tolerant Circuit","authors":"Yan Li, Xiao-yang Zeng, Zhengqi Gao, Liyu Lin, Jun Tao, Jun Han, Xu Cheng, M. Tahoori, Xiaoyang Zeng","doi":"10.1109/DAC18072.2020.9218696","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218696","url":null,"abstract":"Soft error is a major reliability concern in advanced technology nodes. Although mitigating Soft Error Rate (SER) will inevitably sacrifice area and power, few studies paid attention to optimization methods to explore trade-offs between area, power and SER. This paper proposes an optimization framework based on Bayesian approach for soft-error-tolerant circuit design. It comprises two steps:1) data preprocessing and 2) Bayesian optimization. In the preprocessing step, a strategy incorporating k-means algorithm and a novel sequencing algorithm is used to cluster Flip-Flops (FFs) with similar SER in order to reduce the dimensionality for the subsequent step. Bayesian Neural Network (BNN) is the applied surrogate model for acquiring the posterior distribution of three design metrics, while the Lower confidence bound (LCB) functions are employed as acquisition functions to select the next point based on BNN when optimizing. Finally, the non-dominated sorting genetic algorithm (NSGA-II) is used to search the Pareto Optimal Front (POF) solutions of three LCB functions. Experimental results demonstrate the proposed framework has a 1.4x improvement in accuracy and a 70% reduction in SER with acceptable increases in power and area.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128851978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218553
Rick Bahr, Clark W. Barrett, Nikhil Bhagdikar, Alex Carsello, Ross G. Daly, Caleb Donovick, David Durst, K. Fatahalian, Kathleen Feng, P. Hanrahan, Teguh Hofstee, M. Horowitz, Dillon Huff, Fredrik Kjolstad, Taeyoung Kong, Qiaoyi Liu, Makai Mann, J. Melchert, Ankita Nayak, Aina Niemetz, Gedeon Nyengele, Priyanka Raina, S. Richardson, Rajsekhar Setaluri, Jeff Setter, Kavya Sreedhar, Maxwell Strange, James J. Thomas, Christopher Torng, Lenny Truong, Nestan Tsiskaridze, Keyi Zhang
Although an agile approach is standard for software design, how to properly adapt this method to hardware is still an open question. This work addresses this question while building a system on chip (SoC) with specialized accelerators. Rather than using a traditional waterfall design flow, which starts by studying the application to be accelerated, we begin by constructing a complete flow from an application expressed in a high-level domain-specific language (DSL), in our case Halide, to a generic coarse-grained reconfigurable array (CGRA). As our under-standing of the application grows, the CGRA design evolves, and we have developed a suite of tools that tune application code, the compiler, and the CGRA to increase the efficiency of the resulting implementation. To meet our continued need to update parts of the system while maintaining the end-to-end flow, we have created DSL-based hardware generators that not only provide the Verilog needed for the implementation of the CGRA, but also create the collateral that the compiler/mapper/place and route system needs to configure its operation. This work provides a systematic approach for desiging and evolving high-performance and energy-efficient hardware-software systems for any application domain.
{"title":"Creating an Agile Hardware Design Flow","authors":"Rick Bahr, Clark W. Barrett, Nikhil Bhagdikar, Alex Carsello, Ross G. Daly, Caleb Donovick, David Durst, K. Fatahalian, Kathleen Feng, P. Hanrahan, Teguh Hofstee, M. Horowitz, Dillon Huff, Fredrik Kjolstad, Taeyoung Kong, Qiaoyi Liu, Makai Mann, J. Melchert, Ankita Nayak, Aina Niemetz, Gedeon Nyengele, Priyanka Raina, S. Richardson, Rajsekhar Setaluri, Jeff Setter, Kavya Sreedhar, Maxwell Strange, James J. Thomas, Christopher Torng, Lenny Truong, Nestan Tsiskaridze, Keyi Zhang","doi":"10.1109/DAC18072.2020.9218553","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218553","url":null,"abstract":"Although an agile approach is standard for software design, how to properly adapt this method to hardware is still an open question. This work addresses this question while building a system on chip (SoC) with specialized accelerators. Rather than using a traditional waterfall design flow, which starts by studying the application to be accelerated, we begin by constructing a complete flow from an application expressed in a high-level domain-specific language (DSL), in our case Halide, to a generic coarse-grained reconfigurable array (CGRA). As our under-standing of the application grows, the CGRA design evolves, and we have developed a suite of tools that tune application code, the compiler, and the CGRA to increase the efficiency of the resulting implementation. To meet our continued need to update parts of the system while maintaining the end-to-end flow, we have created DSL-based hardware generators that not only provide the Verilog needed for the implementation of the CGRA, but also create the collateral that the compiler/mapper/place and route system needs to configure its operation. This work provides a systematic approach for desiging and evolving high-performance and energy-efficient hardware-software systems for any application domain.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125343268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
On-chip training of large-scale deep neural networks (DNNs) is challenging due to computational complexity and resource limitation. Compute-in-memory (CIM) architecture exploits the analog computation inside the memory array to speed up the vectormatrix multiplication (VMM) and alleviate the memory bottleneck. However, existing CIM prototype chips, in particular, SRAM-based accelerators target at implementing low-precision inference engine only. In this work, we propose a two-way SRAM array design that could perform bi-directional in-memory VMM with minimum hardware overhead. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. We taped-out and validated proposed two-way SRAM array design in TSMC 28nm process. Based on the silicon measurement data on CIM macro, we explore the hardware performance for the entire architecture for DNN on-chip training. The experimental data shows that proposed accelerator can achieve energy efficiency of ~3.2 TOPS/W, >1000 FPS and >300 FPS for ResNet and DenseNet training on ImageNet, respectively.
{"title":"A Two-way SRAM Array based Accelerator for Deep Neural Network On-chip Training","authors":"Hongwu Jiang, Shanshi Huang, Xiaochen Peng, Jian-Wei Su, Yen-Chi Chou, Wei-Hsing Huang, Ta-Wei Liu, Ruhui Liu, Meng-Fan Chang, Shimeng Yu","doi":"10.1109/DAC18072.2020.9218524","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218524","url":null,"abstract":"On-chip training of large-scale deep neural networks (DNNs) is challenging due to computational complexity and resource limitation. Compute-in-memory (CIM) architecture exploits the analog computation inside the memory array to speed up the vectormatrix multiplication (VMM) and alleviate the memory bottleneck. However, existing CIM prototype chips, in particular, SRAM-based accelerators target at implementing low-precision inference engine only. In this work, we propose a two-way SRAM array design that could perform bi-directional in-memory VMM with minimum hardware overhead. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. We taped-out and validated proposed two-way SRAM array design in TSMC 28nm process. Based on the silicon measurement data on CIM macro, we explore the hardware performance for the entire architecture for DNN on-chip training. The experimental data shows that proposed accelerator can achieve energy efficiency of ~3.2 TOPS/W, >1000 FPS and >300 FPS for ResNet and DenseNet training on ImageNet, respectively.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125347223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218563
Lukas Burgholzer, R. Wille
The rapid rate of progress in the physical realization of quantum computers sparked the development of elaborate design flows for quantum computations on such devices. Each stage of these flows comes with its own representation of the intended functionality. Ensuring that each design step preserves this intended functionality is of utmost importance. However, existing solutions for equivalence checking of quantum computations heavily struggle with the complexity of the underlying problem and, thus, no conclusions on the equivalence may be reached with reasonable efforts in many cases. In this work, we uncover the power of simulation for equivalence checking in quantum computing. We show that, in contrast to classical computing, it is in general not necessary to compare the complete representation of the respective computations. Even small errors frequently affect the entire representation and, thus, can be detected within a couple of simulations. The resulting equivalence checking flow substantially improves upon the state of the art by drastically accelerating the detection of errors or providing a highly probable estimate of the operations’ equivalence.
{"title":"The Power of Simulation for Equivalence Checking in Quantum Computing","authors":"Lukas Burgholzer, R. Wille","doi":"10.1109/DAC18072.2020.9218563","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218563","url":null,"abstract":"The rapid rate of progress in the physical realization of quantum computers sparked the development of elaborate design flows for quantum computations on such devices. Each stage of these flows comes with its own representation of the intended functionality. Ensuring that each design step preserves this intended functionality is of utmost importance. However, existing solutions for equivalence checking of quantum computations heavily struggle with the complexity of the underlying problem and, thus, no conclusions on the equivalence may be reached with reasonable efforts in many cases. In this work, we uncover the power of simulation for equivalence checking in quantum computing. We show that, in contrast to classical computing, it is in general not necessary to compare the complete representation of the respective computations. Even small errors frequently affect the entire representation and, thus, can be detected within a couple of simulations. The resulting equivalence checking flow substantially improves upon the state of the art by drastically accelerating the detection of errors or providing a highly probable estimate of the operations’ equivalence.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125350275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218640
Daniel Casini, P. Pazzaglia, Alessandro Biondi, M. Natale, G. Buttazzo
Predictable execution models have been proposed over the years to achieve contention-free execution of real-time tasks by preloading data into dedicated local memories. In this way, memory access delays can be hidden by delegating a DMA engine to perform memory transfers in parallel with processor execution. Nevertheless, state-of-the-art protocols introduce additional blocking due to priority inversion, which may severely penalize latency-sensitive applications and even worsen the system schedulability with respect to the use of classical scheduling schemes. This paper proposes a new protocol that allows hiding memory transfer delays while reducing priority inversion, thus favoring the schedulability of latency-sensitive tasks. The corresponding analysis is formulated as an optimization problem. Experimental results show the advantages of the proposed protocol against state-of-the-art solutions.
{"title":"Predictable Memory-CPU Co-Scheduling with Support for Latency-Sensitive Tasks","authors":"Daniel Casini, P. Pazzaglia, Alessandro Biondi, M. Natale, G. Buttazzo","doi":"10.1109/DAC18072.2020.9218640","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218640","url":null,"abstract":"Predictable execution models have been proposed over the years to achieve contention-free execution of real-time tasks by preloading data into dedicated local memories. In this way, memory access delays can be hidden by delegating a DMA engine to perform memory transfers in parallel with processor execution. Nevertheless, state-of-the-art protocols introduce additional blocking due to priority inversion, which may severely penalize latency-sensitive applications and even worsen the system schedulability with respect to the use of classical scheduling schemes. This paper proposes a new protocol that allows hiding memory transfer delays while reducing priority inversion, thus favoring the schedulability of latency-sensitive tasks. The corresponding analysis is formulated as an optimization problem. Experimental results show the advantages of the proposed protocol against state-of-the-art solutions.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116072400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218564
A. Hamann, Selma Saidi, David Ginthoer, C. Wietfeld, D. Ziegenbein
Many industrial players are currently challenged in building distributed CPS and IoT applications with stringent endto-end QoS requirements. Examples are Vehicle-to-X applications, Advanced Driver-Assistance Systems (ADAS) or functionalities in the Industrial Internet of Things (IIoT). Currently, there is no comprehensive solution allowing to efficiently program, deploy, and operate such distributed applications. This paper will focus on real-time concerns, in building distributed CPS and IoT systems. Thereby, the focus lies, on the one hand, on mechanisms required inside of the IoT (compute) nodes, and, on the other hand, on communication protocols such as TSN and 5G connecting them. In the authors’ view, the required building blocks for a first end-to-end technology stack are available. However, their integration into a holistic framework is missing.
{"title":"Building End-to-End IoT Applications with QoS Guarantees","authors":"A. Hamann, Selma Saidi, David Ginthoer, C. Wietfeld, D. Ziegenbein","doi":"10.1109/DAC18072.2020.9218564","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218564","url":null,"abstract":"Many industrial players are currently challenged in building distributed CPS and IoT applications with stringent endto-end QoS requirements. Examples are Vehicle-to-X applications, Advanced Driver-Assistance Systems (ADAS) or functionalities in the Industrial Internet of Things (IIoT). Currently, there is no comprehensive solution allowing to efficiently program, deploy, and operate such distributed applications. This paper will focus on real-time concerns, in building distributed CPS and IoT systems. Thereby, the focus lies, on the one hand, on mechanisms required inside of the IoT (compute) nodes, and, on the other hand, on communication protocols such as TSN and 5G connecting them. In the authors’ view, the required building blocks for a first end-to-end technology stack are available. However, their integration into a holistic framework is missing.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116078625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218494
Pascal Pieper, V. Herdt, Daniel Große, R. Drechsler
Avoiding security vulnerabilities is very important for embedded systems. Dynamic Information Flow Tracking (DIFT) is a powerful technique to analyze SW with respect to security policies in order to protect the system against a broad range of security related exploits. However, existing DIFT approaches either do not exist for Virtual Prototypes (VPs) or fail to model complex hardware/software interactions.In this paper, we present a novel approach that enables early and accurate DIFT of binaries targeting embedded systems with custom peripherals. Leveraging the SystemC framework, our DIFT engine tracks accurate data flow information alongside the program execution to detect violations of security policies at run-time. We demonstrate the effectiveness and applicability of our approach by extensive experiments.
避免安全漏洞对于嵌入式系统来说是非常重要的。动态信息流跟踪(Dynamic Information Flow Tracking, DIFT)是一种强大的技术,可以根据安全策略分析软件,从而保护系统免受各种与安全相关的攻击。然而,现有的DIFT方法要么不存在于虚拟原型(vp),要么不能对复杂的硬件/软件交互建模。在本文中,我们提出了一种新颖的方法,可以实现针对具有自定义外设的嵌入式系统的二进制文件的早期和准确的DIFT。利用SystemC框架,我们的DIFT引擎在程序执行过程中跟踪准确的数据流信息,从而在运行时检测对安全策略的违反。我们通过大量的实验证明了该方法的有效性和适用性。
{"title":"Dynamic Information Flow Tracking for Embedded Binaries using SystemC-based Virtual Prototypes","authors":"Pascal Pieper, V. Herdt, Daniel Große, R. Drechsler","doi":"10.1109/DAC18072.2020.9218494","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218494","url":null,"abstract":"Avoiding security vulnerabilities is very important for embedded systems. Dynamic Information Flow Tracking (DIFT) is a powerful technique to analyze SW with respect to security policies in order to protect the system against a broad range of security related exploits. However, existing DIFT approaches either do not exist for Virtual Prototypes (VPs) or fail to model complex hardware/software interactions.In this paper, we present a novel approach that enables early and accurate DIFT of binaries targeting embedded systems with custom peripherals. Leveraging the SystemC framework, our DIFT engine tracks accurate data flow information alongside the program execution to detect violations of security policies at run-time. We demonstrate the effectiveness and applicability of our approach by extensive experiments.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"74 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120977659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While Deep Neural Networks (DNNs) are becoming increasingly popular, there is a growing trend to accelerate the DNN applications on hardware platforms like GPUs, FPGAs, etc., to gain higher performance and efficiency. However, it is time-consuming to tune the performance for such platforms due to the large design space and the expensive cost to evaluate each design point. Although many tuning algorithms, such as XGBoost tuner and genetic algorithm (GA) tuner, have been proposed to guide the design space exploring process in the previous work, the timing issue still remains a critical problem. In this work, we propose a novel auto-tuning framework to optimize the DNN operator design on GPU by leveraging the tuning history efficiently in different scenarios. Our experiments show that we can achieve superior performance than the state-of-the-art work, such as auto-tuning framework TVM and the handcraft optimized library cuDNN, while reducing the searching time by 8.96x and 4.58x comparing with XGBoost tuner and GA tuner in TVM.
{"title":"A History-Based Auto-Tuning Framework for Fast and High-Performance DNN Design on GPU","authors":"Jiandong Mu, Mengdi Wang, Lanbo Li, Jun Yang, Wei Lin, Wei Zhang","doi":"10.1109/DAC18072.2020.9218700","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218700","url":null,"abstract":"While Deep Neural Networks (DNNs) are becoming increasingly popular, there is a growing trend to accelerate the DNN applications on hardware platforms like GPUs, FPGAs, etc., to gain higher performance and efficiency. However, it is time-consuming to tune the performance for such platforms due to the large design space and the expensive cost to evaluate each design point. Although many tuning algorithms, such as XGBoost tuner and genetic algorithm (GA) tuner, have been proposed to guide the design space exploring process in the previous work, the timing issue still remains a critical problem. In this work, we propose a novel auto-tuning framework to optimize the DNN operator design on GPU by leveraging the tuning history efficiently in different scenarios. Our experiments show that we can achieve superior performance than the state-of-the-art work, such as auto-tuning framework TVM and the handcraft optimized library cuDNN, while reducing the searching time by 8.96x and 4.58x comparing with XGBoost tuner and GA tuner in TVM.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116730710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218709
Yuxin Yang, Xiaoming Chen, Yinhe Han
Collision detection is a fundamental task in motion planning of robotics. Typically, the performance of collision detection is the bottleneck of an entire motion planning, and so does the energy consumption. Several hardware accelerators have been proposed for collision detection, which achieves higher performance and energy efficiency than general-purpose CPUs and GPUs. However, existing accelerators are still facing the limited memory bandwidth bottleneck, due to the large data volume required by the parallel processing cores and the limited DRAM bandwidth. In this work, we propose a novel collision detection accelerator by employing the processing-in-memory technique. We elaborate the in-memory processing architecture to fully utilize the internal bandwidth of DRAM banks. To make the algorithm and hardware suitable for in-memory processing to be highly efficient, a set of innovative software and hardware techniques are also proposed. Compared with a state-of-the-art ASIC-based collision detection accelerator, both performance and energy efficiency of our accelerator are significantly improved.
{"title":"Dadu-CD: Fast and Efficient Processing-in-Memory Accelerator for Collision Detection","authors":"Yuxin Yang, Xiaoming Chen, Yinhe Han","doi":"10.1109/DAC18072.2020.9218709","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218709","url":null,"abstract":"Collision detection is a fundamental task in motion planning of robotics. Typically, the performance of collision detection is the bottleneck of an entire motion planning, and so does the energy consumption. Several hardware accelerators have been proposed for collision detection, which achieves higher performance and energy efficiency than general-purpose CPUs and GPUs. However, existing accelerators are still facing the limited memory bandwidth bottleneck, due to the large data volume required by the parallel processing cores and the limited DRAM bandwidth. In this work, we propose a novel collision detection accelerator by employing the processing-in-memory technique. We elaborate the in-memory processing architecture to fully utilize the internal bandwidth of DRAM banks. To make the algorithm and hardware suitable for in-memory processing to be highly efficient, a set of innovative software and hardware techniques are also proposed. Compared with a state-of-the-art ASIC-based collision detection accelerator, both performance and energy efficiency of our accelerator are significantly improved.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116747688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}