Pub Date : 2017-11-13DOI: 10.1109/ICCAD.2017.8203850
Quan Deng, Youtao Zhang, Minxuan Zhang, Jun Yang
Modern Graphics Processing Units (GPUs) widely adopt large SRAM based register file (RF) to enable fast context-switch. A large SRAM RF may consume 20% to 40% GPU power, which has become one of the major design challenges for GPUs. Recent studies mitigate the issue through hybrid RF designs that architect a large STT-RAM (Spin Transfer Torque Magnetic memory) RF and a small SRAM buffer. However, the long STT-RAM write latency throttles the data exchange between STT-RAM and SRAM, which deprecates warp scheduler with frequent context switches, e.g., round robin scheduler. In this paper, we propose HC-RF, a warp-scheduler friendly hybrid RF design using novel SRAM/STT-RAM hybrid cell (HC) structure. HC-RF exploits cell level integration to improve the effective bandwidth between STT-RAM and SRAM. By enabling silent data transfer from SRAM to STT-RAM without blocking RF banks, HC-RF supports concurrent context-switching and decouples its dependency on warp scheduler. Our experimental results show that, on average, HC-RF achieves 50% performance improvement and 44% energy consumption reduction over the coarse-grained hybrid design when adopting LRR(Loose Round Robin) warp scheduler.
{"title":"Towards warp-scheduler friendly STT-RAM/SRAM hybrid GPGPU register file design","authors":"Quan Deng, Youtao Zhang, Minxuan Zhang, Jun Yang","doi":"10.1109/ICCAD.2017.8203850","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203850","url":null,"abstract":"Modern Graphics Processing Units (GPUs) widely adopt large SRAM based register file (RF) to enable fast context-switch. A large SRAM RF may consume 20% to 40% GPU power, which has become one of the major design challenges for GPUs. Recent studies mitigate the issue through hybrid RF designs that architect a large STT-RAM (Spin Transfer Torque Magnetic memory) RF and a small SRAM buffer. However, the long STT-RAM write latency throttles the data exchange between STT-RAM and SRAM, which deprecates warp scheduler with frequent context switches, e.g., round robin scheduler. In this paper, we propose HC-RF, a warp-scheduler friendly hybrid RF design using novel SRAM/STT-RAM hybrid cell (HC) structure. HC-RF exploits cell level integration to improve the effective bandwidth between STT-RAM and SRAM. By enabling silent data transfer from SRAM to STT-RAM without blocking RF banks, HC-RF supports concurrent context-switching and decouples its dependency on warp scheduler. Our experimental results show that, on average, HC-RF achieves 50% performance improvement and 44% energy consumption reduction over the coarse-grained hybrid design when adopting LRR(Loose Round Robin) warp scheduler.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124142057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-13DOI: 10.1109/ICCAD.2017.8203842
S. Chaudhuri, A. Hetzel
This paper describes a compilation technique used to accelerate dataflow computations, common in deep neural network computing, onto Coarse Grained Reconfigurable Array (CGRA) architectures. This technique has been demonstrated to automatically compile dataflow programs onto a commercial massively parallel CGRA-based dataflow processor (DPU) containing 16000 processing elements. The DPU architecture overcomes the von Neumann bottleneck by spatially flowing and reusing data from local memories, and provides higher computation efficiency compared to temporal parallel architectures such as GPUs and multi-core CPUs. However, existing software development tools for CGRAs are limited to compiling domain specific programs to processing elements with uniform structures, and are not effective on complex micro architectures where latencies of memory access vary in a nontrivial fashion depending on data locality. A primary contribution of this paper is to provide a general algorithm that can compile general dataflow graphs, and can efficiently utilize processing elements with rich micro-architectural features such as complex instructions, multi-precision data paths, local memories, register files, switches etc. Another contribution is a uniquely innovative application of Boolean Satisfiability to formally solve this complex, and irregular optimization problem and produce high-quality results comparable to hand-written assembly code produced by human experts. A third contribution is an adaptive windowing algorithm that harnesses the complexity of the SAT-based approach and delivers a scalable and robust solution.
{"title":"SAT-based compilation to a non-vonNeumann processor","authors":"S. Chaudhuri, A. Hetzel","doi":"10.1109/ICCAD.2017.8203842","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203842","url":null,"abstract":"This paper describes a compilation technique used to accelerate dataflow computations, common in deep neural network computing, onto Coarse Grained Reconfigurable Array (CGRA) architectures. This technique has been demonstrated to automatically compile dataflow programs onto a commercial massively parallel CGRA-based dataflow processor (DPU) containing 16000 processing elements. The DPU architecture overcomes the von Neumann bottleneck by spatially flowing and reusing data from local memories, and provides higher computation efficiency compared to temporal parallel architectures such as GPUs and multi-core CPUs. However, existing software development tools for CGRAs are limited to compiling domain specific programs to processing elements with uniform structures, and are not effective on complex micro architectures where latencies of memory access vary in a nontrivial fashion depending on data locality. A primary contribution of this paper is to provide a general algorithm that can compile general dataflow graphs, and can efficiently utilize processing elements with rich micro-architectural features such as complex instructions, multi-precision data paths, local memories, register files, switches etc. Another contribution is a uniquely innovative application of Boolean Satisfiability to formally solve this complex, and irregular optimization problem and produce high-quality results comparable to hand-written assembly code produced by human experts. A third contribution is an adaptive windowing algorithm that harnesses the complexity of the SAT-based approach and delivers a scalable and robust solution.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"322 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116296315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-13DOI: 10.1109/ICCAD.2017.8203783
Arighna Deb, R. Wille, R. Drechsler
Optical circuits received significant interest as a promising alternative to existing electronic systems. Because of this, also the synthesis of optical circuits receives increasing attention. However, initial solutions for the synthesis of optical circuits either rely on manual design or rather straight-forward mappings from established data-structures such as BDDs, SoPs/ESoPs, etc. to the corresponding optical netlist. These approaches hardly utilize the full potential of the gate libraries available in this domain. In this paper, we propose an alternative synthesis solution based on AND-Inverter Graphs (AIGs) which is capable of utilizing this potential. That is, a scheme is presented which dedicatedly maps the given function representation to the desired circuit in a one-to-one fashion — yielding significantly smaller circuit sizes. Experimental evaluations confirm that the proposed solution generates optical circuits with up to 97% less number of gates as compared to existing synthesis approaches.
{"title":"Dedicated synthesis for MZI-based optical circuits based on AND-inverter graphs","authors":"Arighna Deb, R. Wille, R. Drechsler","doi":"10.1109/ICCAD.2017.8203783","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203783","url":null,"abstract":"Optical circuits received significant interest as a promising alternative to existing electronic systems. Because of this, also the synthesis of optical circuits receives increasing attention. However, initial solutions for the synthesis of optical circuits either rely on manual design or rather straight-forward mappings from established data-structures such as BDDs, SoPs/ESoPs, etc. to the corresponding optical netlist. These approaches hardly utilize the full potential of the gate libraries available in this domain. In this paper, we propose an alternative synthesis solution based on AND-Inverter Graphs (AIGs) which is capable of utilizing this potential. That is, a scheme is presented which dedicatedly maps the given function representation to the desired circuit in a one-to-one fashion — yielding significantly smaller circuit sizes. Experimental evaluations confirm that the proposed solution generates optical circuits with up to 97% less number of gates as compared to existing synthesis approaches.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134429887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-13DOI: 10.1109/ICCAD.2017.8203810
Ye Tian, Ting Wang, Qian Zhang, Q. Xu
Computing with memory, which stores function responses of some input patterns into lookup tables offline and retrieves their values when encountering similar patterns (instead of performing online calculation), is a promising energy-efficient computing technique. No doubt to say, with a given lookup table size, the efficiency of this technique depends on which function responses are stored and how they are organized. In this paper, we propose a novel adaptive approximate lookup table based accelerator, wherein we store function responses in a hierarchical manner with increasing fine-grained granularity and accuracy. In addition, the proposed accelerator provides lightweight compensation on output results at different precision levels according to input patterns and output quality requirements. Moreover, our accelerator conducts adaptive lookup table search by exploiting input locality. Experimental results on various computation kernels show significant energy savings of the proposed accelerator over prior solutions.
{"title":"ApproxLUT: A novel approximate lookup table-based accelerator","authors":"Ye Tian, Ting Wang, Qian Zhang, Q. Xu","doi":"10.1109/ICCAD.2017.8203810","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203810","url":null,"abstract":"Computing with memory, which stores function responses of some input patterns into lookup tables offline and retrieves their values when encountering similar patterns (instead of performing online calculation), is a promising energy-efficient computing technique. No doubt to say, with a given lookup table size, the efficiency of this technique depends on which function responses are stored and how they are organized. In this paper, we propose a novel adaptive approximate lookup table based accelerator, wherein we store function responses in a hierarchical manner with increasing fine-grained granularity and accuracy. In addition, the proposed accelerator provides lightweight compensation on output results at different precision levels according to input patterns and output quality requirements. Moreover, our accelerator conducts adaptive lookup table search by exploiting input locality. Experimental results on various computation kernels show significant energy savings of the proposed accelerator over prior solutions.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124524554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-13DOI: 10.1109/ICCAD.2017.8203833
Guo-Gin Fan, Mark Po-Hung Lin
Retention registers/latches are commonly applied to power-gated circuits for state retention during the sleep mode. Recent studies have shown that applying uniform multi-bit retention registers (MBRRs) can reduce the storage size, and hence save more chip area and leakage power compared with single-bit retention registers. In this paper, a new problem formulation of power-gated circuit optimization with nonuniform MBRRs is studied for achieving even more storage saving and higher storage utilization. An ILP-based approach is proposed to effectively explore different combinations of nonuniform MBRR replacement. Experiment results show that the proposed approach can reduce 36% storage size, compared with the state-of-the-art uniform MBRR replacement, while achieving 100% storage utilization.
{"title":"State retention for power gated design with non-uniform multi-bit retention latches","authors":"Guo-Gin Fan, Mark Po-Hung Lin","doi":"10.1109/ICCAD.2017.8203833","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203833","url":null,"abstract":"Retention registers/latches are commonly applied to power-gated circuits for state retention during the sleep mode. Recent studies have shown that applying uniform multi-bit retention registers (MBRRs) can reduce the storage size, and hence save more chip area and leakage power compared with single-bit retention registers. In this paper, a new problem formulation of power-gated circuit optimization with nonuniform MBRRs is studied for achieving even more storage saving and higher storage utilization. An ILP-based approach is proposed to effectively explore different combinations of nonuniform MBRR replacement. Experiment results show that the proposed approach can reduce 36% storage size, compared with the state-of-the-art uniform MBRR replacement, while achieving 100% storage utilization.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121379501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-13DOI: 10.1109/ICCAD.2017.8203788
Hengyu Zhao, Linuo Xue, Ping Chi, Jishen Zhao
Images consume significant storage and space in both consumer devices and in the cloud. As such, image processing applications impose high energy consumption in loading and accessing the image data in the memory. Fortunately, most image processing applications can tolerate approximate image data storage. In addition, multi-level cell spin-transfer torque MRAM (STT-MRAM) offers unique design opportunities as the image memory: the two bits in the memory cell require asymmetric write current — the soft bit requires much less write current than the hard bit. This paper proposes an approximate image processing scheme that improves system energy efficiency without upsetting image quality requirement of applications. Our design consists of (i) an approximate image storage mechanism that strives to only write the soft bits in MLC STT-MRAM main memory with small write current and (ii) a memory mode controller that determines the approximation of image data and coordinates across precise/approximate memory access modes. Our experimental results with various image processing functionalities demonstrate that our design reduces memory access energy consumption by 53% and 2.3 x with 100% user's satisfaction compared with traditional DRAM-based and MLC phase-change-memory-based main memory, respectively.
{"title":"Approximate image storage with multi-level cell STT-MRAM main memory","authors":"Hengyu Zhao, Linuo Xue, Ping Chi, Jishen Zhao","doi":"10.1109/ICCAD.2017.8203788","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203788","url":null,"abstract":"Images consume significant storage and space in both consumer devices and in the cloud. As such, image processing applications impose high energy consumption in loading and accessing the image data in the memory. Fortunately, most image processing applications can tolerate approximate image data storage. In addition, multi-level cell spin-transfer torque MRAM (STT-MRAM) offers unique design opportunities as the image memory: the two bits in the memory cell require asymmetric write current — the soft bit requires much less write current than the hard bit. This paper proposes an approximate image processing scheme that improves system energy efficiency without upsetting image quality requirement of applications. Our design consists of (i) an approximate image storage mechanism that strives to only write the soft bits in MLC STT-MRAM main memory with small write current and (ii) a memory mode controller that determines the approximation of image data and coordinates across precise/approximate memory access modes. Our experimental results with various image processing functionalities demonstrate that our design reduces memory access energy consumption by 53% and 2.3 x with 100% user's satisfaction compared with traditional DRAM-based and MLC phase-change-memory-based main memory, respectively.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124333165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-13DOI: 10.1109/ICCAD.2017.8203878
Shih-Chun Chen, Yao-Wen Chang
FPGAs have emerged as a popular style for modern circuit designs, due mainly to their non-recurring costs, in-field reprogrammability, short turn-around time, etc. A modern FPGA consists of an array of heterogeneous logic components, surrounded by routing resources and bounded by I/O cells. Compared to an ASIC, an FPGA has more limited logic and routing resources, diverse architectures, strict design constraints, etc.; as a result, FPGA placement and routing problems become much more challenging. With growing complexity, diverse design objectives, high heterogeneity, and evolving technologies, further, modern FPGA placement and routing bring up many emerging research opportunities. In this paper, we introduce basic architectures of FPGAs, describe the placement and routing problems for FPGAs, and explain key techniques to solve the problems (including three major placement paradigms: partitioning, simulated annealing, and analytical placement; two routing paradigms: sequential and concurrent routing, and simultaneous placement and routing). Finally, we provide some future research directions for FPGA placement and routing.
{"title":"FPGA placement and routing","authors":"Shih-Chun Chen, Yao-Wen Chang","doi":"10.1109/ICCAD.2017.8203878","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203878","url":null,"abstract":"FPGAs have emerged as a popular style for modern circuit designs, due mainly to their non-recurring costs, in-field reprogrammability, short turn-around time, etc. A modern FPGA consists of an array of heterogeneous logic components, surrounded by routing resources and bounded by I/O cells. Compared to an ASIC, an FPGA has more limited logic and routing resources, diverse architectures, strict design constraints, etc.; as a result, FPGA placement and routing problems become much more challenging. With growing complexity, diverse design objectives, high heterogeneity, and evolving technologies, further, modern FPGA placement and routing bring up many emerging research opportunities. In this paper, we introduce basic architectures of FPGAs, describe the placement and routing problems for FPGAs, and explain key techniques to solve the problems (including three major placement paradigms: partitioning, simulated annealing, and analytical placement; two routing paradigms: sequential and concurrent routing, and simultaneous placement and routing). Finally, we provide some future research directions for FPGA placement and routing.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"9 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129794885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-13DOI: 10.1109/ICCAD.2017.8203854
Lei Zhao, Youtao Zhang, Jun Yang
Neural Networks (NNs) have recently gained popularity in a wide range of modern application domains due to its superior inference accuracy. With growing problem size and complexity, modern NNs, e.g., CNNs (Convolutional NNs) and DNNs (Deep NNs), contain a large number of weights, which require tremendous efforts not only to prepare representative training datasets but also to train the network. There is an increasing demand to protect the NN weight matrices, an emerging Intellectual Property (IP) in NN field. Unfortunately, adopting conventional encryption method faces significant performance and energy consumption overheads. In this paper, we propose AEP, a DianNao based NN accelerator design for IP protection. AEP aggressively reduces DRAM timing to generate a device dependent error mask, i.e., a set of erroneous cells while the distribution of these cells are device dependent due to process variations. AEP incorporates the error mask in the NN training process so that the trained weights are device dependent, which effectively defects IP piracy as exporting the weights to other devices cannot produce satisfactory inference accuracy. In addition, AEP speeds up NN inference and achieves significant energy reduction due to the fact that main memory dominates the energy consumption in DianNao accelerator. Our evaluation results show that by injecting 0.1% to 5% memory errors, AEP has negligible inference accuracy loss on the target device while exhibiting unacceptable accuracy degradation on other devices. In addition, AEP achieves an average of 72% performance improvement and 44% energy reduction over the DianNao baseline.
{"title":"AEP: An error-bearing neural network accelerator for energy efficiency and model protection","authors":"Lei Zhao, Youtao Zhang, Jun Yang","doi":"10.1109/ICCAD.2017.8203854","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203854","url":null,"abstract":"Neural Networks (NNs) have recently gained popularity in a wide range of modern application domains due to its superior inference accuracy. With growing problem size and complexity, modern NNs, e.g., CNNs (Convolutional NNs) and DNNs (Deep NNs), contain a large number of weights, which require tremendous efforts not only to prepare representative training datasets but also to train the network. There is an increasing demand to protect the NN weight matrices, an emerging Intellectual Property (IP) in NN field. Unfortunately, adopting conventional encryption method faces significant performance and energy consumption overheads. In this paper, we propose AEP, a DianNao based NN accelerator design for IP protection. AEP aggressively reduces DRAM timing to generate a device dependent error mask, i.e., a set of erroneous cells while the distribution of these cells are device dependent due to process variations. AEP incorporates the error mask in the NN training process so that the trained weights are device dependent, which effectively defects IP piracy as exporting the weights to other devices cannot produce satisfactory inference accuracy. In addition, AEP speeds up NN inference and achieves significant energy reduction due to the fact that main memory dominates the energy consumption in DianNao accelerator. Our evaluation results show that by injecting 0.1% to 5% memory errors, AEP has negligible inference accuracy loss on the target device while exhibiting unacceptable accuracy degradation on other devices. In addition, AEP achieves an average of 72% performance improvement and 44% energy reduction over the DianNao baseline.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116920797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-13DOI: 10.1109/ICCAD.2017.8203773
K. Aadithya, E. Keiter, Ting Mei
We present DAGSENS, a new approach to parametric transient sensitivity analysis of Differential Algebraic Equation systems (DAEs), such as SPICE-level circuits. The key ideas behind DAGSENS are, (1) to represent the entire sequence of computations from DAE parameters to the objective function (whose sensitivity is needed) as a Directed Acyclic Graph (DAG) called the “sensitivity DAG”, and (2) to compute the required sensitivites efficiently by using dynamic programming techniques to traverse the DAG. DAGSENS is simple, elegant, and easy-to-understand compared to previous approaches; for example, in DAGSENS, one can switch between direct and adjoint sensitivities simply by reversing the direction of DAG traversal. Also, DAGSENS is more powerful than previous approaches because it works for a more general class of objective functions, including those based on “events” that occur during a transient simulation (e.g., a node voltage crossing a threshold, a phase-locked loop (PLL) achieving lock, a circuit signal reaching its maximum/minimum value, etc.). In this paper, we demonstrate DAGSENS on several electronic and biological applications, including high-speed communication, statistical cell library characterization, and gene expression.
{"title":"DAGSENS: Directed acyclic graph based direct and adjoint transient sensitivity analysis for event-driven objective functions","authors":"K. Aadithya, E. Keiter, Ting Mei","doi":"10.1109/ICCAD.2017.8203773","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203773","url":null,"abstract":"We present DAGSENS, a new approach to parametric transient sensitivity analysis of Differential Algebraic Equation systems (DAEs), such as SPICE-level circuits. The key ideas behind DAGSENS are, (1) to represent the entire sequence of computations from DAE parameters to the objective function (whose sensitivity is needed) as a Directed Acyclic Graph (DAG) called the “sensitivity DAG”, and (2) to compute the required sensitivites efficiently by using dynamic programming techniques to traverse the DAG. DAGSENS is simple, elegant, and easy-to-understand compared to previous approaches; for example, in DAGSENS, one can switch between direct and adjoint sensitivities simply by reversing the direction of DAG traversal. Also, DAGSENS is more powerful than previous approaches because it works for a more general class of objective functions, including those based on “events” that occur during a transient simulation (e.g., a node voltage crossing a threshold, a phase-locked loop (PLL) achieving lock, a circuit signal reaching its maximum/minimum value, etc.). In this paper, we demonstrate DAGSENS on several electronic and biological applications, including high-speed communication, statistical cell library characterization, and gene expression.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129470125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-11-13DOI: 10.1109/ICCAD.2017.8203756
Yeseong Kim, M. Imani, T. Simunic
In recent years, machine learning for visual object recognition has been applied to various domains, e.g., autonomous vehicle, heath diagnose, and home automation. However, the recognition procedures still consume a lot of processing energy and incur a high cost of data movement for memory accesses. In this paper, we propose a novel hardware accelerator design, called ORCHARD, which processes the object recognition tasks inside memory. The proposed design accelerates both the image feature extraction and boosting-based learning algorithm, which are key subtasks of the state-of-the-art image recognition approaches. We optimize the recognition procedures by leveraging approximate computing and emerging non-volatile memory (NVM) technology. The NVM-based in-memory processing allows the proposed design to mitigate the CMOS-based computation overhead, highly improving the system efficiency. In our evaluation conducted on circuit- and device-level simulations, we show that ORCHARD successfully performs practical image recognition tasks, including text, face, pedestrian, and vehicle recognition with 0.3% of accuracy loss made by computation approximation. In addition, our design significantly improves the performance and energy efficiency by up to 376x and 1896x, respectively, compared to the existing processor-based implementation.
{"title":"ORCHARD: Visual object recognition accelerator based on approximate in-memory processing","authors":"Yeseong Kim, M. Imani, T. Simunic","doi":"10.1109/ICCAD.2017.8203756","DOIUrl":"https://doi.org/10.1109/ICCAD.2017.8203756","url":null,"abstract":"In recent years, machine learning for visual object recognition has been applied to various domains, e.g., autonomous vehicle, heath diagnose, and home automation. However, the recognition procedures still consume a lot of processing energy and incur a high cost of data movement for memory accesses. In this paper, we propose a novel hardware accelerator design, called ORCHARD, which processes the object recognition tasks inside memory. The proposed design accelerates both the image feature extraction and boosting-based learning algorithm, which are key subtasks of the state-of-the-art image recognition approaches. We optimize the recognition procedures by leveraging approximate computing and emerging non-volatile memory (NVM) technology. The NVM-based in-memory processing allows the proposed design to mitigate the CMOS-based computation overhead, highly improving the system efficiency. In our evaluation conducted on circuit- and device-level simulations, we show that ORCHARD successfully performs practical image recognition tasks, including text, face, pedestrian, and vehicle recognition with 0.3% of accuracy loss made by computation approximation. In addition, our design significantly improves the performance and energy efficiency by up to 376x and 1896x, respectively, compared to the existing processor-based implementation.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"373 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116364542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}