首页 > 最新文献

2020 57th ACM/IEEE Design Automation Conference (DAC)最新文献

英文 中文
Prediction Confidence based Low Complexity Gradient Computation for Accelerating DNN Training 基于预测置信度的低复杂度梯度计算加速DNN训练
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218650
Dongyeob Shin, Geonho Kim, Joongho Jo, Jongsun Park
In deep neural network (DNN) training, network weights are iteratively updated with the weight gradients that are obtained from stochastic gradient descent (SGD). Since SGD inherently allows gradient calculations with noise, approximating weight gradient computations have a large potential of training energy/time savings without degrading accuracy. In this paper, we propose an input-dependent approximation of the weight gradient for improving energy efficiency of training process. Considering that the output predictions of network (confidence) changes with training inputs, the relation between the confidence and the magnitude of weight gradient can be efficiently exploited to skip the gradient computations without accuracy drop, especially for high confidence inputs. With a given squared error constraint, the computation skip rates can be also controlled by changing the confidence threshold. The simulation results show that our approach can skip 72.6% of gradient computations for CIFAR-100 dataset using ResNet-18 without accuracy degradation. Hardware implementation with 65nm CMOS process shows that our design achieves 88.84% and 98.16% of maximum per epoch training energy and time savings, respectively, for CIFAR-100 dataset using ResNet-18 compared to state-of-the-art training accelerator.
在深度神经网络(DNN)训练中,使用随机梯度下降法(SGD)得到的权重梯度迭代更新网络权重。由于SGD固有地允许带噪声的梯度计算,因此近似权梯度计算具有很大的潜力,可以节省训练能量/时间,而不会降低准确性。为了提高训练过程的能量效率,我们提出了一种输入依赖的权梯度近似。考虑到网络的输出预测(置信度)随着训练输入的变化而变化,可以有效地利用置信度与权重梯度大小之间的关系,在不降低准确率的情况下跳过梯度计算,特别是对于高置信度的输入。在给定平方误差约束下,还可以通过改变置信阈值来控制计算跳过率。仿真结果表明,对于使用ResNet-18的CIFAR-100数据集,我们的方法可以跳过72.6%的梯度计算,且精度没有下降。采用65nm CMOS工艺的硬件实现表明,与最先进的训练加速器相比,我们的设计在使用ResNet-18的CIFAR-100数据集上,每个epoch最大训练能量和时间分别节省了88.84%和98.16%。
{"title":"Prediction Confidence based Low Complexity Gradient Computation for Accelerating DNN Training","authors":"Dongyeob Shin, Geonho Kim, Joongho Jo, Jongsun Park","doi":"10.1109/DAC18072.2020.9218650","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218650","url":null,"abstract":"In deep neural network (DNN) training, network weights are iteratively updated with the weight gradients that are obtained from stochastic gradient descent (SGD). Since SGD inherently allows gradient calculations with noise, approximating weight gradient computations have a large potential of training energy/time savings without degrading accuracy. In this paper, we propose an input-dependent approximation of the weight gradient for improving energy efficiency of training process. Considering that the output predictions of network (confidence) changes with training inputs, the relation between the confidence and the magnitude of weight gradient can be efficiently exploited to skip the gradient computations without accuracy drop, especially for high confidence inputs. With a given squared error constraint, the computation skip rates can be also controlled by changing the confidence threshold. The simulation results show that our approach can skip 72.6% of gradient computations for CIFAR-100 dataset using ResNet-18 without accuracy degradation. Hardware implementation with 65nm CMOS process shows that our design achieves 88.84% and 98.16% of maximum per epoch training energy and time savings, respectively, for CIFAR-100 dataset using ResNet-18 compared to state-of-the-art training accelerator.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129305058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Taming Unstructured Sparsity on GPUs via Latency-Aware Optimization 通过延迟感知优化来驯服gpu上的非结构化稀疏性
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218644
Maohua Zhu, Yuan Xie
Neural Networks (NNs) exhibit high redundancy in their parameters so that pruning methods can achieve high compression ratio without accuracy loss. However, the very high sparsity produced by unstructured pruning methods is difficult to be efficiently mapped onto Graphics Processing Units (GPUs) because of its decoding overhead and workload imbalance. With the introduction of Tensor Core, the latest GPUs achieve even higher throughput for the dense neural networks. This makes unstructured neural networks fail to outperform their dense counterparts because they are not currently supported by Tensor Core. To tackle this problem, prior work suggests structured pruning to improve the performance of sparse NNs on GPUs. However, such structured pruning methods have to sacrifice a significant part of sparsity to retain the model accuracy, which limits the speedup on the hardware. In this paper, we observe that the Tensor Core is also able to compute unstructured sparse NNs efficiently. To achieve this goal, we first propose ExTensor, a set of sparse Tensor Core instructions with a variable input matrix tile size. The variable tile size allows a matrix multiplication to be implemented by mixing different types of ExTensor instructions. We build a performance model to estimate the latency of an ExTensor instruction given an operand sparse weight matrix. Based on this model, we propose a heuristic algorithm to find the optimal sequence of the instructions for an ExTensor based kernel to achieve the best performance on the GPU. Experimental results demonstrate that our approach achieves 36% better performance than the state-of-the-art sparse Tensor Core design.
神经网络的参数具有高冗余性,因此修剪方法可以在不损失精度的情况下获得高压缩比。然而,非结构化剪枝方法产生的非常高的稀疏性由于其解码开销和工作负载不平衡而难以有效地映射到图形处理单元上。随着Tensor Core的引入,最新的gpu为密集的神经网络实现了更高的吞吐量。这使得非结构化神经网络无法胜过密集神经网络,因为它们目前不受Tensor Core的支持。为了解决这个问题,之前的工作建议采用结构化修剪来提高gpu上稀疏神经网络的性能。然而,这种结构化的剪枝方法必须牺牲很大一部分的稀疏性来保持模型的准确性,这限制了硬件上的加速。在本文中,我们观察到张量核心也能够有效地计算非结构化稀疏神经网络。为了实现这一目标,我们首先提出了ExTensor,这是一组具有可变输入矩阵大小的稀疏张量核心指令。可变的块大小允许通过混合不同类型的ExTensor指令来实现矩阵乘法。在给定一个操作数稀疏权矩阵的情况下,我们建立了一个性能模型来估计ExTensor指令的延迟。基于该模型,我们提出了一种启发式算法,为基于ExTensor的内核寻找最优指令序列,以在GPU上实现最佳性能。实验结果表明,我们的方法比最先进的稀疏张量核心设计的性能提高了36%。
{"title":"Taming Unstructured Sparsity on GPUs via Latency-Aware Optimization","authors":"Maohua Zhu, Yuan Xie","doi":"10.1109/DAC18072.2020.9218644","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218644","url":null,"abstract":"Neural Networks (NNs) exhibit high redundancy in their parameters so that pruning methods can achieve high compression ratio without accuracy loss. However, the very high sparsity produced by unstructured pruning methods is difficult to be efficiently mapped onto Graphics Processing Units (GPUs) because of its decoding overhead and workload imbalance. With the introduction of Tensor Core, the latest GPUs achieve even higher throughput for the dense neural networks. This makes unstructured neural networks fail to outperform their dense counterparts because they are not currently supported by Tensor Core. To tackle this problem, prior work suggests structured pruning to improve the performance of sparse NNs on GPUs. However, such structured pruning methods have to sacrifice a significant part of sparsity to retain the model accuracy, which limits the speedup on the hardware. In this paper, we observe that the Tensor Core is also able to compute unstructured sparse NNs efficiently. To achieve this goal, we first propose ExTensor, a set of sparse Tensor Core instructions with a variable input matrix tile size. The variable tile size allows a matrix multiplication to be implemented by mixing different types of ExTensor instructions. We build a performance model to estimate the latency of an ExTensor instruction given an operand sparse weight matrix. Based on this model, we propose a heuristic algorithm to find the optimal sequence of the instructions for an ExTensor based kernel to achieve the best performance on the GPU. Experimental results demonstrate that our approach achieves 36% better performance than the state-of-the-art sparse Tensor Core design.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127496304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
WET: Write Efficient Loop Tiling for Non-Volatile Main Memory 非易失性主存的写高效循环平铺
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218612
Mohammad A. Alshboul, James Tuck, Yan Solihin
Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (loop tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.
未来的系统预计将越来越多地包括非易失性主存储器(NVMM)。但是,由于NVMM的写持久性有限,因此必须减少写次数。虽然已经提出了新的体系结构和算法来减少对NVMM的写操作,但很少或没有研究关注编译器优化对写操作的影响。在本文中,我们研究了一种流行的编译器优化(循环平铺)对一个非常重要的计算内核(矩阵乘法)的影响。我们的新观察包括,在矩阵乘法上平铺会导致25倍的写入放大。此外,我们还研究了通过选择正确的瓷砖尺寸和采用分层平铺来使平铺更加NVMM友好的技术。我们的方法write - efficient Tiling (WET)添加了一个新的外部tile,用于将写工作集拟合到最后一级缓存(LLC),以减少对NVMM的写次数。我们的实验减少了81%的写入,同时提高了性能。
{"title":"WET: Write Efficient Loop Tiling for Non-Volatile Main Memory","authors":"Mohammad A. Alshboul, James Tuck, Yan Solihin","doi":"10.1109/DAC18072.2020.9218612","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218612","url":null,"abstract":"Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (loop tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131304667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Scalable Multi-FPGA Acceleration for Large RNNs with Full Parallelism Levels 具有完全并行性的大型rnn的可扩展多fpga加速
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218528
Dongup Kwon, Suyeon Hur, Hamin Jang, E. Nurvitadhi, Jangwoo Kim
The increasing size of recurrent neural networks (RNNs) makes it hard to meet the growing demand for real-time AI services. For low-latency RNN serving, FPGA-based accelerators can leverage specialized architectures with optimized dataflow. However, they also suffer from severe HW under-utilization when partitioning RNNs, and thus fail to obtain the scalable performance.In this paper, we identify the performance bottlenecks of existing RNN partitioning strategies. Then, we propose a novel RNN partitioning strategy to achieve the scalable multi-FPGA acceleration for large RNNs. First, we introduce three parallelism levels and exploit them by partitioning weight matrices, matrix/vector operations, and layers. Second, we examine the performance impact of collective communications and software pipelining to derive more accurate and optimal distribution results. We prototyped an FPGA-based acceleration system using multiple Intel high-end FPGAs, and our partitioning scheme allows up to 2.4x faster inference of modern RNN workloads than conventional partitioning methods.
递归神经网络(rnn)的规模不断扩大,难以满足日益增长的实时人工智能服务需求。对于低延迟RNN服务,基于fpga的加速器可以利用具有优化数据流的专用架构。然而,它们在对rnn进行分区时也存在严重的硬件利用率不足,无法获得可扩展性能。在本文中,我们识别了现有RNN分区策略的性能瓶颈。然后,我们提出了一种新的RNN划分策略,以实现大型RNN的可扩展多fpga加速。首先,我们引入了三个并行性级别,并通过划分权重矩阵、矩阵/向量操作和层来利用它们。其次,我们研究了集体通信和软件流水线对性能的影响,以得出更准确和最优的分布结果。我们使用多个英特尔高端fpga原型设计了一个基于fpga的加速系统,我们的分区方案允许比传统分区方法快2.4倍的现代RNN工作负载推理。
{"title":"Scalable Multi-FPGA Acceleration for Large RNNs with Full Parallelism Levels","authors":"Dongup Kwon, Suyeon Hur, Hamin Jang, E. Nurvitadhi, Jangwoo Kim","doi":"10.1109/DAC18072.2020.9218528","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218528","url":null,"abstract":"The increasing size of recurrent neural networks (RNNs) makes it hard to meet the growing demand for real-time AI services. For low-latency RNN serving, FPGA-based accelerators can leverage specialized architectures with optimized dataflow. However, they also suffer from severe HW under-utilization when partitioning RNNs, and thus fail to obtain the scalable performance.In this paper, we identify the performance bottlenecks of existing RNN partitioning strategies. Then, we propose a novel RNN partitioning strategy to achieve the scalable multi-FPGA acceleration for large RNNs. First, we introduce three parallelism levels and exploit them by partitioning weight matrices, matrix/vector operations, and layers. Second, we examine the performance impact of collective communications and software pipelining to derive more accurate and optimal distribution results. We prototyped an FPGA-based acceleration system using multiple Intel high-end FPGAs, and our partitioning scheme allows up to 2.4x faster inference of modern RNN workloads than conventional partitioning methods.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122881759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
CAP’NN: Class-Aware Personalized Neural Network Inference 类别感知的个性化神经网络推理
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218741
Maedeh Hemmat, Joshua San Miguel, A. Davoodi
We propose CAP’NN, a framework for Class-Aware Personalized Neural Network Inference. CAP’NN prunes an already-trained neural network model based on the preferences of individual users. Specifically, by adapting to the subset of output classes that each user is expected to encounter, CAP’NN is able to prune not only ineffectual neurons but also miseffectual neurons that confuse classification, without the need to retrain the network. CAP’NN achieves up to 50% model size reduction while actually improving the top-l(5) classification accuracy by up to 2.3%(3.2%) when the user only encounters a subset of VGG-16 classes.
我们提出了一种基于类感知的个性化神经网络推理框架CAP 'NN。CAP 'NN根据个人用户的偏好对已经训练好的神经网络模型进行修剪。具体来说,通过适应每个用户预期会遇到的输出类的子集,CAP 'NN不仅能够修剪无效的神经元,还能够修剪混淆分类的无效神经元,而无需重新训练网络。当用户只遇到VGG-16类的一个子集时,CAP 'NN实现了高达50%的模型尺寸缩减,同时实际上将top- 1(5)分类精度提高了2.3%(3.2%)。
{"title":"CAP’NN: Class-Aware Personalized Neural Network Inference","authors":"Maedeh Hemmat, Joshua San Miguel, A. Davoodi","doi":"10.1109/DAC18072.2020.9218741","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218741","url":null,"abstract":"We propose CAP’NN, a framework for Class-Aware Personalized Neural Network Inference. CAP’NN prunes an already-trained neural network model based on the preferences of individual users. Specifically, by adapting to the subset of output classes that each user is expected to encounter, CAP’NN is able to prune not only ineffectual neurons but also miseffectual neurons that confuse classification, without the need to retrain the network. CAP’NN achieves up to 50% model size reduction while actually improving the top-l(5) classification accuracy by up to 2.3%(3.2%) when the user only encounters a subset of VGG-16 classes.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124186928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Romeo: Conversion and Evaluation of HDL Designs in the Encrypted Domain 罗密欧:加密领域中HDL设计的转换与评估
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218579
Charles Gouert, N. G. Tsoutsos
As cloud computing becomes increasingly ubiquitous, protecting the confidentiality of data outsourced to third parties becomes a priority. While encryption is a natural solution to this problem, traditional algorithms may only protect data at rest and in transit, but do not support encrypted processing. In this work we introduce ROMEO, which enables easy-to-use privacy-preserving processing of data in the cloud using homomorphic encryption. ROMEO automatically converts arbitrary programs expressed in Verilog HDL into equivalent homomorphic circuits that are evaluated using encrypted inputs. For our experiments, we employ cryptographic circuits, such as AES, and benchmarks from the ISCAS’85 and ISCAS’89 suites.
随着云计算变得越来越普遍,保护外包给第三方的数据的机密性成为一个优先事项。虽然加密是解决这个问题的自然方法,但传统算法可能只保护静态和传输中的数据,而不支持加密处理。在这项工作中,我们介绍了ROMEO,它可以使用同态加密对云中的数据进行易于使用的隐私保护处理。ROMEO自动转换在Verilog HDL中表达的任意程序为等效的同态电路,使用加密输入进行评估。在我们的实验中,我们使用了加密电路,如AES,以及来自ISCAS ' 85和ISCAS ' 89套件的基准测试。
{"title":"Romeo: Conversion and Evaluation of HDL Designs in the Encrypted Domain","authors":"Charles Gouert, N. G. Tsoutsos","doi":"10.1109/DAC18072.2020.9218579","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218579","url":null,"abstract":"As cloud computing becomes increasingly ubiquitous, protecting the confidentiality of data outsourced to third parties becomes a priority. While encryption is a natural solution to this problem, traditional algorithms may only protect data at rest and in transit, but do not support encrypted processing. In this work we introduce ROMEO, which enables easy-to-use privacy-preserving processing of data in the cloud using homomorphic encryption. ROMEO automatically converts arbitrary programs expressed in Verilog HDL into equivalent homomorphic circuits that are evaluated using encrypted inputs. For our experiments, we employ cryptographic circuits, such as AES, and benchmarks from the ISCAS’85 and ISCAS’89 suites.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116367182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
ATUNs: Modular and Scalable Support for Atomic Operations in a Shared Memory Multiprocessor atun:共享内存多处理器中原子操作的模块化和可伸缩支持
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218661
Andreas Kurth, Samuel Riedel, Florian Zaruba, T. Hoefler, L. Benini
Atomic operations are crucial for most modern parallel and concurrent algorithms, which necessitates their optimized implementation in highly-scalable manycore processors. We pro-pose a modular and efficient, open-source ATomic UNit (ATUN) architecture that can be placed flexibly at different levels of the memory hierarchy. ATUN demonstrates near-optimal linear scaling for various synthetic and real-world workloads on an FPGA prototype with 32 RISC-V cores. We characterize the hardware complexity of our ATUN design in 22 nm FDSOI and find that it scales linearly in area (only 0.5 kGE per core) and logarithmically in the critical path.
原子操作对于大多数现代并行和并发算法至关重要,这就需要在高可伸缩的多核处理器中对其进行优化实现。我们提出了一个模块化的、高效的、开源的原子单元(ATUN)架构,它可以灵活地放置在内存层次结构的不同级别。ATUN在具有32个RISC-V内核的FPGA原型上为各种合成和实际工作负载展示了近乎最佳的线性缩放。我们在22 nm FDSOI中表征了我们的ATUN设计的硬件复杂性,并发现它在面积上呈线性缩放(每核仅0.5 kGE),在关键路径上呈对数缩放。
{"title":"ATUNs: Modular and Scalable Support for Atomic Operations in a Shared Memory Multiprocessor","authors":"Andreas Kurth, Samuel Riedel, Florian Zaruba, T. Hoefler, L. Benini","doi":"10.1109/DAC18072.2020.9218661","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218661","url":null,"abstract":"Atomic operations are crucial for most modern parallel and concurrent algorithms, which necessitates their optimized implementation in highly-scalable manycore processors. We pro-pose a modular and efficient, open-source ATomic UNit (ATUN) architecture that can be placed flexibly at different levels of the memory hierarchy. ATUN demonstrates near-optimal linear scaling for various synthetic and real-world workloads on an FPGA prototype with 32 RISC-V cores. We characterize the hardware complexity of our ATUN design in 22 nm FDSOI and find that it scales linearly in area (only 0.5 kGE per core) and logarithmically in the critical path.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116904053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
CL(R)Early: An Early-stage DSE Methodology for Cross-Layer Reliability-aware Heterogeneous Embedded Systems 李志强(R)早期:跨层可靠性感知异构嵌入式系统的早期DSE方法
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218747
Siva Satyendra Sahoo, B. Veeravalli, Akash Kumar
Cross-layer reliability (CLR) presents a cost-effective alternative to traditional single-layer design in resource-constrained embedded systems. CLR provides the scope for leveraging the inherent fault-masking of multiple layers and exploiting application-specific tolerances to degradation in some Quality of Service (QoS) metrics. However, it can also lead to an explosion in the design complexity. State-of-the art approaches to such joint optimization across multiple degrees of freedom can lead to degradation in the system-level Design Space Exploration (DSE) results. To this end, we propose a DSE methodology for enabling CLR-aware task-mapping in heterogeneous embedded systems. Specifically, we present novel approaches to both task and system-level analysis for performing an early-stage exploration of various design decisions. The proposed methodology results in considerable improvements over other state-of-the-art approaches and shows significant scaling with application size.
在资源受限的嵌入式系统中,跨层可靠性(CLR)为传统的单层设计提供了一种经济有效的替代方案。CLR提供了利用多层固有的故障屏蔽的范围,并在某些服务质量(QoS)度量中利用特定于应用程序的降级容忍度。然而,它也可能导致设计复杂性的爆炸。这种跨多个自由度的联合优化的最新方法可能导致系统级设计空间探索(DSE)结果的退化。为此,我们提出了一种在异构嵌入式系统中实现clr感知任务映射的DSE方法。具体来说,我们提出了任务级和系统级分析的新方法,用于执行各种设计决策的早期探索。与其他最先进的方法相比,所提出的方法得到了相当大的改进,并显示出应用程序大小的显著可伸缩性。
{"title":"CL(R)Early: An Early-stage DSE Methodology for Cross-Layer Reliability-aware Heterogeneous Embedded Systems","authors":"Siva Satyendra Sahoo, B. Veeravalli, Akash Kumar","doi":"10.1109/DAC18072.2020.9218747","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218747","url":null,"abstract":"Cross-layer reliability (CLR) presents a cost-effective alternative to traditional single-layer design in resource-constrained embedded systems. CLR provides the scope for leveraging the inherent fault-masking of multiple layers and exploiting application-specific tolerances to degradation in some Quality of Service (QoS) metrics. However, it can also lead to an explosion in the design complexity. State-of-the art approaches to such joint optimization across multiple degrees of freedom can lead to degradation in the system-level Design Space Exploration (DSE) results. To this end, we propose a DSE methodology for enabling CLR-aware task-mapping in heterogeneous embedded systems. Specifically, we present novel approaches to both task and system-level analysis for performing an early-stage exploration of various design decisions. The proposed methodology results in considerable improvements over other state-of-the-art approaches and shows significant scaling with application size.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115209714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Cross-Layer Power and Timing Evaluation Method for Wide Voltage Scaling 宽电压标度的跨层功率和时序评估方法
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218682
Wenjie Fu, Leilei Jin, Ming Ling, Yu Zheng, Longxing Shi
Wide supply voltage scaling is critical to enable worthwhile dynamic adjustment of the processor efficiency against varying workloads. In this paper, a cross-layer power and timing evaluation method is proposed to estimate the processor energy efficiency using both circuit and architectural information in a wide voltage range. The process variations are considered through statistical static timing analysis while the voltage effect is modeled through secondary iterated fittings. The error for estimating processor energy efficiency decreases to 8.29% when the supply voltage is scaled from 1.1V to 0.6V, while traditional architectural evaluations behave more than 40% errors.
宽电源电压缩放对于实现针对不同工作负载的有价值的处理器效率动态调整至关重要。本文提出了一种跨层功率和时序评估方法,在宽电压范围内利用电路和结构信息来评估处理器的能量效率。通过统计静态时序分析来考虑过程变化,通过二次迭代拟合来模拟电压效应。当电源电压从1.1V缩放到0.6V时,处理器能量效率的估计误差降低到8.29%,而传统的架构评估误差超过40%。
{"title":"A Cross-Layer Power and Timing Evaluation Method for Wide Voltage Scaling","authors":"Wenjie Fu, Leilei Jin, Ming Ling, Yu Zheng, Longxing Shi","doi":"10.1109/DAC18072.2020.9218682","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218682","url":null,"abstract":"Wide supply voltage scaling is critical to enable worthwhile dynamic adjustment of the processor efficiency against varying workloads. In this paper, a cross-layer power and timing evaluation method is proposed to estimate the processor energy efficiency using both circuit and architectural information in a wide voltage range. The process variations are considered through statistical static timing analysis while the voltage effect is modeled through secondary iterated fittings. The error for estimating processor energy efficiency decreases to 8.29% when the supply voltage is scaled from 1.1V to 0.6V, while traditional architectural evaluations behave more than 40% errors.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125327886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
ALSRAC: Approximate Logic Synthesis by Resubstitution with Approximate Care Set 基于近似关心集的近似逻辑综合
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218627
Chang Meng, Weikang Qian, A. Mishchenko
Approximate computing is an emerging design technique for error-resilient applications. It improves circuit area, power, and delay at the cost of introducing some errors. Approximate logic synthesis (ALS) is an automatic process to produce approximate circuits. This paper proposes approximate resubstitution with approximate care set and uses it to build a simulation-based ALS flow. The experimental results demonstrate that the proposed method saves 7%–18% area compared to state-of-the-art methods. The code of ALSRAC is made open-source.
近似计算是一种新兴的容错设计技术。它以引入一些误差为代价,改善了电路面积、功率和延迟。近似逻辑合成(ALS)是一种自动生成近似电路的过程。本文提出了近似关心集的近似重替换,并利用它构建了一个基于仿真的ALS流程。实验结果表明,与现有方法相比,该方法节省了7% ~ 18%的面积。ALSRAC的代码是开源的。
{"title":"ALSRAC: Approximate Logic Synthesis by Resubstitution with Approximate Care Set","authors":"Chang Meng, Weikang Qian, A. Mishchenko","doi":"10.1109/DAC18072.2020.9218627","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218627","url":null,"abstract":"Approximate computing is an emerging design technique for error-resilient applications. It improves circuit area, power, and delay at the cost of introducing some errors. Approximate logic synthesis (ALS) is an automatic process to produce approximate circuits. This paper proposes approximate resubstitution with approximate care set and uses it to build a simulation-based ALS flow. The experimental results demonstrate that the proposed method saves 7%–18% area compared to state-of-the-art methods. The code of ALSRAC is made open-source.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126910666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
期刊
2020 57th ACM/IEEE Design Automation Conference (DAC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1