Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218650
Dongyeob Shin, Geonho Kim, Joongho Jo, Jongsun Park
In deep neural network (DNN) training, network weights are iteratively updated with the weight gradients that are obtained from stochastic gradient descent (SGD). Since SGD inherently allows gradient calculations with noise, approximating weight gradient computations have a large potential of training energy/time savings without degrading accuracy. In this paper, we propose an input-dependent approximation of the weight gradient for improving energy efficiency of training process. Considering that the output predictions of network (confidence) changes with training inputs, the relation between the confidence and the magnitude of weight gradient can be efficiently exploited to skip the gradient computations without accuracy drop, especially for high confidence inputs. With a given squared error constraint, the computation skip rates can be also controlled by changing the confidence threshold. The simulation results show that our approach can skip 72.6% of gradient computations for CIFAR-100 dataset using ResNet-18 without accuracy degradation. Hardware implementation with 65nm CMOS process shows that our design achieves 88.84% and 98.16% of maximum per epoch training energy and time savings, respectively, for CIFAR-100 dataset using ResNet-18 compared to state-of-the-art training accelerator.
{"title":"Prediction Confidence based Low Complexity Gradient Computation for Accelerating DNN Training","authors":"Dongyeob Shin, Geonho Kim, Joongho Jo, Jongsun Park","doi":"10.1109/DAC18072.2020.9218650","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218650","url":null,"abstract":"In deep neural network (DNN) training, network weights are iteratively updated with the weight gradients that are obtained from stochastic gradient descent (SGD). Since SGD inherently allows gradient calculations with noise, approximating weight gradient computations have a large potential of training energy/time savings without degrading accuracy. In this paper, we propose an input-dependent approximation of the weight gradient for improving energy efficiency of training process. Considering that the output predictions of network (confidence) changes with training inputs, the relation between the confidence and the magnitude of weight gradient can be efficiently exploited to skip the gradient computations without accuracy drop, especially for high confidence inputs. With a given squared error constraint, the computation skip rates can be also controlled by changing the confidence threshold. The simulation results show that our approach can skip 72.6% of gradient computations for CIFAR-100 dataset using ResNet-18 without accuracy degradation. Hardware implementation with 65nm CMOS process shows that our design achieves 88.84% and 98.16% of maximum per epoch training energy and time savings, respectively, for CIFAR-100 dataset using ResNet-18 compared to state-of-the-art training accelerator.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129305058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218644
Maohua Zhu, Yuan Xie
Neural Networks (NNs) exhibit high redundancy in their parameters so that pruning methods can achieve high compression ratio without accuracy loss. However, the very high sparsity produced by unstructured pruning methods is difficult to be efficiently mapped onto Graphics Processing Units (GPUs) because of its decoding overhead and workload imbalance. With the introduction of Tensor Core, the latest GPUs achieve even higher throughput for the dense neural networks. This makes unstructured neural networks fail to outperform their dense counterparts because they are not currently supported by Tensor Core. To tackle this problem, prior work suggests structured pruning to improve the performance of sparse NNs on GPUs. However, such structured pruning methods have to sacrifice a significant part of sparsity to retain the model accuracy, which limits the speedup on the hardware. In this paper, we observe that the Tensor Core is also able to compute unstructured sparse NNs efficiently. To achieve this goal, we first propose ExTensor, a set of sparse Tensor Core instructions with a variable input matrix tile size. The variable tile size allows a matrix multiplication to be implemented by mixing different types of ExTensor instructions. We build a performance model to estimate the latency of an ExTensor instruction given an operand sparse weight matrix. Based on this model, we propose a heuristic algorithm to find the optimal sequence of the instructions for an ExTensor based kernel to achieve the best performance on the GPU. Experimental results demonstrate that our approach achieves 36% better performance than the state-of-the-art sparse Tensor Core design.
{"title":"Taming Unstructured Sparsity on GPUs via Latency-Aware Optimization","authors":"Maohua Zhu, Yuan Xie","doi":"10.1109/DAC18072.2020.9218644","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218644","url":null,"abstract":"Neural Networks (NNs) exhibit high redundancy in their parameters so that pruning methods can achieve high compression ratio without accuracy loss. However, the very high sparsity produced by unstructured pruning methods is difficult to be efficiently mapped onto Graphics Processing Units (GPUs) because of its decoding overhead and workload imbalance. With the introduction of Tensor Core, the latest GPUs achieve even higher throughput for the dense neural networks. This makes unstructured neural networks fail to outperform their dense counterparts because they are not currently supported by Tensor Core. To tackle this problem, prior work suggests structured pruning to improve the performance of sparse NNs on GPUs. However, such structured pruning methods have to sacrifice a significant part of sparsity to retain the model accuracy, which limits the speedup on the hardware. In this paper, we observe that the Tensor Core is also able to compute unstructured sparse NNs efficiently. To achieve this goal, we first propose ExTensor, a set of sparse Tensor Core instructions with a variable input matrix tile size. The variable tile size allows a matrix multiplication to be implemented by mixing different types of ExTensor instructions. We build a performance model to estimate the latency of an ExTensor instruction given an operand sparse weight matrix. Based on this model, we propose a heuristic algorithm to find the optimal sequence of the instructions for an ExTensor based kernel to achieve the best performance on the GPU. Experimental results demonstrate that our approach achieves 36% better performance than the state-of-the-art sparse Tensor Core design.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127496304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218612
Mohammad A. Alshboul, James Tuck, Yan Solihin
Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (loop tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.
{"title":"WET: Write Efficient Loop Tiling for Non-Volatile Main Memory","authors":"Mohammad A. Alshboul, James Tuck, Yan Solihin","doi":"10.1109/DAC18072.2020.9218612","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218612","url":null,"abstract":"Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (loop tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131304667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218528
Dongup Kwon, Suyeon Hur, Hamin Jang, E. Nurvitadhi, Jangwoo Kim
The increasing size of recurrent neural networks (RNNs) makes it hard to meet the growing demand for real-time AI services. For low-latency RNN serving, FPGA-based accelerators can leverage specialized architectures with optimized dataflow. However, they also suffer from severe HW under-utilization when partitioning RNNs, and thus fail to obtain the scalable performance.In this paper, we identify the performance bottlenecks of existing RNN partitioning strategies. Then, we propose a novel RNN partitioning strategy to achieve the scalable multi-FPGA acceleration for large RNNs. First, we introduce three parallelism levels and exploit them by partitioning weight matrices, matrix/vector operations, and layers. Second, we examine the performance impact of collective communications and software pipelining to derive more accurate and optimal distribution results. We prototyped an FPGA-based acceleration system using multiple Intel high-end FPGAs, and our partitioning scheme allows up to 2.4x faster inference of modern RNN workloads than conventional partitioning methods.
{"title":"Scalable Multi-FPGA Acceleration for Large RNNs with Full Parallelism Levels","authors":"Dongup Kwon, Suyeon Hur, Hamin Jang, E. Nurvitadhi, Jangwoo Kim","doi":"10.1109/DAC18072.2020.9218528","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218528","url":null,"abstract":"The increasing size of recurrent neural networks (RNNs) makes it hard to meet the growing demand for real-time AI services. For low-latency RNN serving, FPGA-based accelerators can leverage specialized architectures with optimized dataflow. However, they also suffer from severe HW under-utilization when partitioning RNNs, and thus fail to obtain the scalable performance.In this paper, we identify the performance bottlenecks of existing RNN partitioning strategies. Then, we propose a novel RNN partitioning strategy to achieve the scalable multi-FPGA acceleration for large RNNs. First, we introduce three parallelism levels and exploit them by partitioning weight matrices, matrix/vector operations, and layers. Second, we examine the performance impact of collective communications and software pipelining to derive more accurate and optimal distribution results. We prototyped an FPGA-based acceleration system using multiple Intel high-end FPGAs, and our partitioning scheme allows up to 2.4x faster inference of modern RNN workloads than conventional partitioning methods.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122881759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218741
Maedeh Hemmat, Joshua San Miguel, A. Davoodi
We propose CAP’NN, a framework for Class-Aware Personalized Neural Network Inference. CAP’NN prunes an already-trained neural network model based on the preferences of individual users. Specifically, by adapting to the subset of output classes that each user is expected to encounter, CAP’NN is able to prune not only ineffectual neurons but also miseffectual neurons that confuse classification, without the need to retrain the network. CAP’NN achieves up to 50% model size reduction while actually improving the top-l(5) classification accuracy by up to 2.3%(3.2%) when the user only encounters a subset of VGG-16 classes.
{"title":"CAP’NN: Class-Aware Personalized Neural Network Inference","authors":"Maedeh Hemmat, Joshua San Miguel, A. Davoodi","doi":"10.1109/DAC18072.2020.9218741","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218741","url":null,"abstract":"We propose CAP’NN, a framework for Class-Aware Personalized Neural Network Inference. CAP’NN prunes an already-trained neural network model based on the preferences of individual users. Specifically, by adapting to the subset of output classes that each user is expected to encounter, CAP’NN is able to prune not only ineffectual neurons but also miseffectual neurons that confuse classification, without the need to retrain the network. CAP’NN achieves up to 50% model size reduction while actually improving the top-l(5) classification accuracy by up to 2.3%(3.2%) when the user only encounters a subset of VGG-16 classes.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124186928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218579
Charles Gouert, N. G. Tsoutsos
As cloud computing becomes increasingly ubiquitous, protecting the confidentiality of data outsourced to third parties becomes a priority. While encryption is a natural solution to this problem, traditional algorithms may only protect data at rest and in transit, but do not support encrypted processing. In this work we introduce ROMEO, which enables easy-to-use privacy-preserving processing of data in the cloud using homomorphic encryption. ROMEO automatically converts arbitrary programs expressed in Verilog HDL into equivalent homomorphic circuits that are evaluated using encrypted inputs. For our experiments, we employ cryptographic circuits, such as AES, and benchmarks from the ISCAS’85 and ISCAS’89 suites.
{"title":"Romeo: Conversion and Evaluation of HDL Designs in the Encrypted Domain","authors":"Charles Gouert, N. G. Tsoutsos","doi":"10.1109/DAC18072.2020.9218579","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218579","url":null,"abstract":"As cloud computing becomes increasingly ubiquitous, protecting the confidentiality of data outsourced to third parties becomes a priority. While encryption is a natural solution to this problem, traditional algorithms may only protect data at rest and in transit, but do not support encrypted processing. In this work we introduce ROMEO, which enables easy-to-use privacy-preserving processing of data in the cloud using homomorphic encryption. ROMEO automatically converts arbitrary programs expressed in Verilog HDL into equivalent homomorphic circuits that are evaluated using encrypted inputs. For our experiments, we employ cryptographic circuits, such as AES, and benchmarks from the ISCAS’85 and ISCAS’89 suites.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116367182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218661
Andreas Kurth, Samuel Riedel, Florian Zaruba, T. Hoefler, L. Benini
Atomic operations are crucial for most modern parallel and concurrent algorithms, which necessitates their optimized implementation in highly-scalable manycore processors. We pro-pose a modular and efficient, open-source ATomic UNit (ATUN) architecture that can be placed flexibly at different levels of the memory hierarchy. ATUN demonstrates near-optimal linear scaling for various synthetic and real-world workloads on an FPGA prototype with 32 RISC-V cores. We characterize the hardware complexity of our ATUN design in 22 nm FDSOI and find that it scales linearly in area (only 0.5 kGE per core) and logarithmically in the critical path.
{"title":"ATUNs: Modular and Scalable Support for Atomic Operations in a Shared Memory Multiprocessor","authors":"Andreas Kurth, Samuel Riedel, Florian Zaruba, T. Hoefler, L. Benini","doi":"10.1109/DAC18072.2020.9218661","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218661","url":null,"abstract":"Atomic operations are crucial for most modern parallel and concurrent algorithms, which necessitates their optimized implementation in highly-scalable manycore processors. We pro-pose a modular and efficient, open-source ATomic UNit (ATUN) architecture that can be placed flexibly at different levels of the memory hierarchy. ATUN demonstrates near-optimal linear scaling for various synthetic and real-world workloads on an FPGA prototype with 32 RISC-V cores. We characterize the hardware complexity of our ATUN design in 22 nm FDSOI and find that it scales linearly in area (only 0.5 kGE per core) and logarithmically in the critical path.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116904053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218747
Siva Satyendra Sahoo, B. Veeravalli, Akash Kumar
Cross-layer reliability (CLR) presents a cost-effective alternative to traditional single-layer design in resource-constrained embedded systems. CLR provides the scope for leveraging the inherent fault-masking of multiple layers and exploiting application-specific tolerances to degradation in some Quality of Service (QoS) metrics. However, it can also lead to an explosion in the design complexity. State-of-the art approaches to such joint optimization across multiple degrees of freedom can lead to degradation in the system-level Design Space Exploration (DSE) results. To this end, we propose a DSE methodology for enabling CLR-aware task-mapping in heterogeneous embedded systems. Specifically, we present novel approaches to both task and system-level analysis for performing an early-stage exploration of various design decisions. The proposed methodology results in considerable improvements over other state-of-the-art approaches and shows significant scaling with application size.
{"title":"CL(R)Early: An Early-stage DSE Methodology for Cross-Layer Reliability-aware Heterogeneous Embedded Systems","authors":"Siva Satyendra Sahoo, B. Veeravalli, Akash Kumar","doi":"10.1109/DAC18072.2020.9218747","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218747","url":null,"abstract":"Cross-layer reliability (CLR) presents a cost-effective alternative to traditional single-layer design in resource-constrained embedded systems. CLR provides the scope for leveraging the inherent fault-masking of multiple layers and exploiting application-specific tolerances to degradation in some Quality of Service (QoS) metrics. However, it can also lead to an explosion in the design complexity. State-of-the art approaches to such joint optimization across multiple degrees of freedom can lead to degradation in the system-level Design Space Exploration (DSE) results. To this end, we propose a DSE methodology for enabling CLR-aware task-mapping in heterogeneous embedded systems. Specifically, we present novel approaches to both task and system-level analysis for performing an early-stage exploration of various design decisions. The proposed methodology results in considerable improvements over other state-of-the-art approaches and shows significant scaling with application size.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115209714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218682
Wenjie Fu, Leilei Jin, Ming Ling, Yu Zheng, Longxing Shi
Wide supply voltage scaling is critical to enable worthwhile dynamic adjustment of the processor efficiency against varying workloads. In this paper, a cross-layer power and timing evaluation method is proposed to estimate the processor energy efficiency using both circuit and architectural information in a wide voltage range. The process variations are considered through statistical static timing analysis while the voltage effect is modeled through secondary iterated fittings. The error for estimating processor energy efficiency decreases to 8.29% when the supply voltage is scaled from 1.1V to 0.6V, while traditional architectural evaluations behave more than 40% errors.
{"title":"A Cross-Layer Power and Timing Evaluation Method for Wide Voltage Scaling","authors":"Wenjie Fu, Leilei Jin, Ming Ling, Yu Zheng, Longxing Shi","doi":"10.1109/DAC18072.2020.9218682","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218682","url":null,"abstract":"Wide supply voltage scaling is critical to enable worthwhile dynamic adjustment of the processor efficiency against varying workloads. In this paper, a cross-layer power and timing evaluation method is proposed to estimate the processor energy efficiency using both circuit and architectural information in a wide voltage range. The process variations are considered through statistical static timing analysis while the voltage effect is modeled through secondary iterated fittings. The error for estimating processor energy efficiency decreases to 8.29% when the supply voltage is scaled from 1.1V to 0.6V, while traditional architectural evaluations behave more than 40% errors.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125327886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218627
Chang Meng, Weikang Qian, A. Mishchenko
Approximate computing is an emerging design technique for error-resilient applications. It improves circuit area, power, and delay at the cost of introducing some errors. Approximate logic synthesis (ALS) is an automatic process to produce approximate circuits. This paper proposes approximate resubstitution with approximate care set and uses it to build a simulation-based ALS flow. The experimental results demonstrate that the proposed method saves 7%–18% area compared to state-of-the-art methods. The code of ALSRAC is made open-source.
{"title":"ALSRAC: Approximate Logic Synthesis by Resubstitution with Approximate Care Set","authors":"Chang Meng, Weikang Qian, A. Mishchenko","doi":"10.1109/DAC18072.2020.9218627","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218627","url":null,"abstract":"Approximate computing is an emerging design technique for error-resilient applications. It improves circuit area, power, and delay at the cost of introducing some errors. Approximate logic synthesis (ALS) is an automatic process to produce approximate circuits. This paper proposes approximate resubstitution with approximate care set and uses it to build a simulation-based ALS flow. The experimental results demonstrate that the proposed method saves 7%–18% area compared to state-of-the-art methods. The code of ALSRAC is made open-source.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126910666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}