首页 > 最新文献

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)最新文献

英文 中文
The Anytime Automaton 随时自动机
Joshua San Miguel, Natalie D. Enright Jerger
Approximate computing is an emerging paradigm enabling tradeoffs between accuracy and efficiency. However, a fundamental challenge persists: state-of-the-art techniques lack the ability to enforce runtime guarantees on accuracy. The convention is to 1) employ offline or online accuracy models, or 2) present experimental results that demonstrate empirically low error. Unfortunately, these approaches are still unable to guarantee acceptability of all application outputs at runtime. We offer a solution that revisits concepts from anytime algorithms. Originally explored for real-time decision problems, anytime algorithms have the property of producing results with increasing accuracy over time. We propose the Anytime Automaton, a new computation model that executes applications as a parallel pipeline of anytime approximations. An automaton produces approximate versions of the application output with increasing accuracy, guaranteeing that the final precise version is eventually reached. The automaton can be stopped whenever the output is deemed acceptable, otherwise, it is a simple matter of letting it run longer. We present an in-depth analysis of the model and demonstrate attractive runtime-accuracy profiles on various applications. Our anytime automaton is the first step towards systems where the acceptability of an application's output directly governs the amount of time and energy expended.
近似计算是一种新兴的范式,能够在精度和效率之间进行权衡。然而,一个基本的挑战仍然存在:最先进的技术缺乏对准确性执行运行时保证的能力。惯例是1)采用离线或在线精度模型,或2)提供经验上证明低误差的实验结果。不幸的是,这些方法仍然不能保证运行时所有应用程序输出的可接受性。我们提供了一个解决方案,重新审视了任何时间算法的概念。anytime算法最初是为实时决策问题探索的,随着时间的推移,它产生的结果的准确性越来越高。我们提出了随时自动机,这是一种新的计算模型,它将应用程序作为随时逼近的并行管道来执行。自动机以越来越高的精度产生应用程序输出的近似版本,保证最终达到最终的精确版本。只要输出被认为是可接受的,就可以停止这个自动机,否则,让它运行更长时间就是一个简单的问题。我们对该模型进行了深入分析,并在各种应用程序上展示了具有吸引力的运行时精度概要。我们的任意时间自动化是迈向系统的第一步,其中应用程序输出的可接受性直接控制所花费的时间和精力。
{"title":"The Anytime Automaton","authors":"Joshua San Miguel, Natalie D. Enright Jerger","doi":"10.1145/3007787.3001195","DOIUrl":"https://doi.org/10.1145/3007787.3001195","url":null,"abstract":"Approximate computing is an emerging paradigm enabling tradeoffs between accuracy and efficiency. However, a fundamental challenge persists: state-of-the-art techniques lack the ability to enforce runtime guarantees on accuracy. The convention is to 1) employ offline or online accuracy models, or 2) present experimental results that demonstrate empirically low error. Unfortunately, these approaches are still unable to guarantee acceptability of all application outputs at runtime. We offer a solution that revisits concepts from anytime algorithms. Originally explored for real-time decision problems, anytime algorithms have the property of producing results with increasing accuracy over time. We propose the Anytime Automaton, a new computation model that executes applications as a parallel pipeline of anytime approximations. An automaton produces approximate versions of the application output with increasing accuracy, guaranteeing that the final precise version is eventually reached. The automaton can be stopped whenever the output is deemed acceptable, otherwise, it is a simple matter of letting it run longer. We present an in-depth analysis of the model and demonstrate attractive runtime-accuracy profiles on various applications. Our anytime automaton is the first step towards systems where the acceptability of an application's output directly governs the amount of time and energy expended.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"782 1","pages":"545-557"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89229621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks 基于卷积神经网络的高能效数据流空间架构
Yu-hsin Chen, J. Emer, V. Sze
Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy. In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.
深度卷积神经网络(cnn)以其优越的精度和较高的计算复杂度为代价,在现代人工智能系统中得到了广泛的应用。复杂性来自于需要同时处理高维卷积中的数百个过滤器和通道,这涉及到大量的数据移动。尽管SIMD/SIMT等高度并行计算范式有效地解决了实现高吞吐量的计算需求,但能耗仍然很高,因为数据移动可能比计算更昂贵。因此,找到一个支持并行处理的数据流,以最小的数据移动成本是实现节能的CNN处理而不影响准确性的关键。在本文中,我们提出了一种新的数据流,称为行平稳(RS),它可以最大限度地减少空间架构上的数据移动能耗。这是通过在高维卷积中利用过滤器权重和特征映射像素的局部数据重用(即激活)以及最小化部分和累积的数据移动来实现的。与现有设计中使用的数据流只能减少某些类型的数据移动不同,本文提出的RS数据流可以适应不同的CNN形状配置,并通过最大限度地利用处理引擎(PE)本地存储、PE间直接通信和空间并行性来减少所有类型的数据移动。为了评估不同数据流的能源效率,我们提出了一个分析框架,比较在相同硬件面积和处理并行性约束下的能源成本。使用AlexNet的CNN配置进行的实验表明,所提出的RS数据流在卷积层(1.4×至2.5×)和全连接层(批量大小大于16的至少1.3×)上都比现有数据流更节能。RS数据流也在一个预制芯片上进行了演示,验证了我们的能量分析。
{"title":"Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks","authors":"Yu-hsin Chen, J. Emer, V. Sze","doi":"10.1145/3007787.3001177","DOIUrl":"https://doi.org/10.1145/3007787.3001177","url":null,"abstract":"Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy. In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"28 1","pages":"367-379"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73822920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1281
Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs 减少基于NoC的cmp关键段加速的机会竞争开销
Y. Yao, Zhonghai Lu
With the degree of parallelism increasing, performance of multi-threaded shared variable applications is not only limited by serialized critical section execution, but also by the serialized competition overhead for threads to get access to critical section. As the number of concurrent threads grows, such competition overhead may exceed the time spent in critical section itself, and become the dominating factor limiting the performance of parallel applications. In modern operating systems, queue spinlock, which comprises a low-overhead spinning phase and a high-overhead sleeping phase, is often used to lock critical sections. In the paper, we show that this advanced locking solution may create very high competition overhead for multithreaded applications executing in NoC-based CMPs. Then we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance that a thread wins the critical section access in the low-overhead spinning phase, thereby reducing the competition overhead. At the OS primitives level, we monitor the remaining times of retry (RTR) in a thread's spinning phase, which reflects in how long the thread must enter into the high-overhead sleep mode. At the hardware level, we integrate the RTR information into the packets of locking requests, and let the NoC prioritize locking request packets according to the RTR information. The principle is that the smaller RTR a locking request packet carries, the higher priority it gets and thus quicker delivery. We evaluate our opportunistic competition overhead reduction technique with cycle-accurate full-system simulations in GEM5 using PARSEC (11 programs) and SPEC OMP2012 (14 programs) benchmarks. Compared to the original queue spinlock implementation, experimental results show that our method can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely by 39.9% (maximally by 61.8%) and accelerating the execution of the Region-of-Interest averagely by 14.4% (maximally by 24.5%) across all 25 benchmark programs.
随着并行度的提高,多线程共享变量应用程序的性能不仅受到序列化临界区执行的限制,而且受到线程访问临界区的序列化竞争开销的限制。随着并发线程数量的增加,这种竞争开销可能会超过在临界区本身所花费的时间,并成为限制并行应用程序性能的主要因素。在现代操作系统中,队列自旋锁通常用于锁定临界区,它包括一个低开销的旋转阶段和一个高开销的休眠阶段。在本文中,我们证明了这种先进的锁定解决方案可能会给在基于noc的cmp中执行的多线程应用程序带来非常高的竞争开销。然后,我们提出了一种软硬件合作机制,可以机会最大化线程在低开销旋转阶段赢得临界区访问的机会,从而降低竞争开销。在操作系统原语级别,我们监视线程旋转阶段的剩余重试时间(RTR),这反映了线程必须进入高开销睡眠模式的时间。在硬件层,我们将RTR信息集成到锁定请求包中,并让NoC根据RTR信息对锁定请求包进行优先级排序。原理是,一个锁定请求包携带的RTR越小,它的优先级就越高,因此传递速度就越快。我们在GEM5中使用PARSEC(11个程序)和SPEC OMP2012(14个程序)基准测试,通过循环精确的全系统模拟来评估我们的机会竞争开销降低技术。实验结果表明,与原始队列自旋锁实现相比,我们的方法可以有效地增加线程在低开销自旋阶段进入临界区域的机会,在所有25个基准程序中,竞争开销平均降低39.9%(最大降低61.8%),感兴趣区域的执行速度平均提高14.4%(最大提高24.5%)。
{"title":"Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs","authors":"Y. Yao, Zhonghai Lu","doi":"10.1145/3007787.3001167","DOIUrl":"https://doi.org/10.1145/3007787.3001167","url":null,"abstract":"With the degree of parallelism increasing, performance of multi-threaded shared variable applications is not only limited by serialized critical section execution, but also by the serialized competition overhead for threads to get access to critical section. As the number of concurrent threads grows, such competition overhead may exceed the time spent in critical section itself, and become the dominating factor limiting the performance of parallel applications. In modern operating systems, queue spinlock, which comprises a low-overhead spinning phase and a high-overhead sleeping phase, is often used to lock critical sections. In the paper, we show that this advanced locking solution may create very high competition overhead for multithreaded applications executing in NoC-based CMPs. Then we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance that a thread wins the critical section access in the low-overhead spinning phase, thereby reducing the competition overhead. At the OS primitives level, we monitor the remaining times of retry (RTR) in a thread's spinning phase, which reflects in how long the thread must enter into the high-overhead sleep mode. At the hardware level, we integrate the RTR information into the packets of locking requests, and let the NoC prioritize locking request packets according to the RTR information. The principle is that the smaller RTR a locking request packet carries, the higher priority it gets and thus quicker delivery. We evaluate our opportunistic competition overhead reduction technique with cycle-accurate full-system simulations in GEM5 using PARSEC (11 programs) and SPEC OMP2012 (14 programs) benchmarks. Compared to the original queue spinlock implementation, experimental results show that our method can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely by 39.9% (maximally by 61.8%) and accelerating the execution of the Region-of-Interest averagely by 14.4% (maximally by 24.5%) across all 25 benchmark programs.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"83 1","pages":"279-290"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80337153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars 基于横杆原位模拟算法的卷积神经网络加速器
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, R. Balasubramonian, J. Strachan, Miao Hu, R. S. Williams, Vivek Srikumar
A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks. This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8×, 5.5×, and 7.5× in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture.
最近的一些努力试图为流行的机器学习算法设计加速器,例如涉及卷积和深度神经网络(cnn和dnn)的算法。这些算法通常涉及大量的乘-累积(点积)操作。最近的一个项目“大电脑”采用了一种近数据处理方法,其中一个专门的神经功能单元执行所有的数字算术运算,并从相邻的eDRAM库接收输入权重。这项工作探索了一种原位处理方法,其中忆阻器横条阵列不仅存储输入权重,而且还用于以模拟方式执行点积运算。虽然交叉棒存储器作为模拟点积引擎的使用是众所周知的,但没有先前的工作设计或表征了基于交叉棒的成熟加速器。特别是,我们的工作做出了以下贡献:(i)我们设计了一个流水线架构,每个神经网络层都有一些专用的交叉条,以及在流水线阶段之间聚合数据的eDRAM缓冲区。(ii)我们定义了新的数据编码技术,这些技术适用于模拟计算,并且可以降低模数转换(ADC)的高开销。(iii)我们定义了模拟CNN加速器所需的许多支持数字组件,并进行了设计空间探索,以确定芯片上记忆电阻存储/计算、adc和eDRAM存储的最佳平衡。在一套CNN和DNN工作负载上,与最先进的DaDianNao架构相比,提出的ISAAC架构在吞吐量、能量和计算密度方面分别提高了14.8倍、5.5倍和7.5倍。
{"title":"ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars","authors":"Ali Shafiee, Anirban Nag, Naveen Muralimanohar, R. Balasubramonian, J. Strachan, Miao Hu, R. S. Williams, Vivek Srikumar","doi":"10.1145/3007787.3001139","DOIUrl":"https://doi.org/10.1145/3007787.3001139","url":null,"abstract":"A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks. This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8×, 5.5×, and 7.5× in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"89 1","pages":"14-26"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79400343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1416
EIE: Efficient Inference Engine on Compressed Deep Neural Network 基于压缩深度神经网络的高效推理引擎
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, A. Pedram, M. Horowitz, W. Dally
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving, Exploiting sparsity saves 10x, Weight sharing gives 8x, Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88x104 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.
最先进的深度神经网络(dnn)有数亿个连接,并且计算和内存都很密集,这使得它们很难部署在硬件资源和功耗预算有限的嵌入式系统上。虽然定制硬件有助于计算,但从DRAM中获取权重比ALU操作要贵两个数量级,并且占据了所需的功率。先前提出的“深度压缩”使大型dnn (AlexNet和VGGNet)完全适合片上SRAM成为可能。这种压缩是通过修剪冗余连接和让多个连接共享相同的权重来实现的。我们提出了一种节能推理引擎(EIE),该引擎在该压缩网络模型上执行推理,并通过权值共享加速得到的稀疏矩阵向量乘法。从DRAM到SRAM可以使eee节省120倍的能源,利用稀疏性节省10倍,重量共享节省8倍,从ReLU跳过零激活又节省3倍。在9个DNN基准测试中进行评估,与CPU和GPU实现相同的DNN相比,EIE在没有压缩的情况下分别快了189倍和13倍。EIE在压缩网络上直接工作的处理能力为102 GOPS,相当于未压缩网络上的3 TOPS,以1.88x104帧/秒的速度处理AlexNet的FC层,功耗仅为600mW。它的能效分别是CPU和GPU的24000倍和3400倍。与大电脑相比,EIE的吞吐量、能效和面积效率分别提高2.9倍、19倍和3倍。
{"title":"EIE: Efficient Inference Engine on Compressed Deep Neural Network","authors":"Song Han, Xingyu Liu, Huizi Mao, Jing Pu, A. Pedram, M. Horowitz, W. Dally","doi":"10.1145/3007787.3001163","DOIUrl":"https://doi.org/10.1145/3007787.3001163","url":null,"abstract":"State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving, Exploiting sparsity saves 10x, Weight sharing gives 8x, Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88x104 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"232 1","pages":"243-254"},"PeriodicalIF":0.0,"publicationDate":"2016-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75271239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2224
Efficient synonym filtering and scalable delayed translation for hybrid virtual caching 混合虚拟缓存的高效同义词过滤和可伸缩延迟翻译
Chang Hyun Park, Taekyung Heo, Jaehyuk Huh
Conventional translation look-aside buffers(TLBs) are required to complete address translation withshort latencies, as the address translation is on the criticalpath of all memory accesses even for L1 cache hits. Such strictTLB latency restrictions limit the TLB capacity, as the latencyincrease with large TLBs may lower the overall performanceeven with potential TLB miss reductions. Furthermore, TLBsconsume a significant amount of energy as they are accessedfor every instruction fetch and data access. To avoid thelatency restriction and reduce the energy consumption, virtualcaching techniques have been proposed to defer translation toafter L1 cache misses. However, an efficient solution for thesynonym problem has been a critical issue hindering the wideadoption of virtual caching.Based on the virtual caching concept, this study proposes ahybrid virtual memory architecture extending virtual cachingto the entire cache hierarchy, aiming to improve both performanceand energy consumption. The hybrid virtual cachinguses virtual addresses augmented with address space identifiers(ASID) in the cache hierarchy for common non-synonymaddresses. For such non-synonyms, the address translationoccurs only after last-level cache (LLC) misses. For uncommonsynonym addresses, the addresses are translated to physicaladdresses with conventional TLBs before L1 cache accesses. Tosupport such hybrid translation, we propose an efficient synonymdetection mechanism based on Bloom filters which canidentify synonym candidates with few false positives. For largememory applications, delayed translation alone cannot solvethe address translation problem, as fixed-granularity delayedTLBs may not scale with the increasing memory requirements.To mitigate the translation scalability problem, this studyproposes a delayed many segment translation designed for thehybrid virtual caching. The experimental results show that ourapproach effectively lowers accesses to the TLBs, leading tosignificant power savings. In addition, the approach providesperformance improvement with scalable delayed translationwith variable length segments.
传统的转换暂置缓冲区(tlb)需要以较短的延迟完成地址转换,因为地址转换位于所有内存访问的关键路径上,即使是L1缓存命中。这种严格的lb延迟限制限制了TLB容量,因为大TLB的延迟增加可能会降低整体性能,即使潜在的TLB丢失减少。此外,tlb消耗了大量的能量,因为每次指令读取和数据访问都要访问它们。为了避免延迟限制和减少能量消耗,虚拟缓存技术被提出将转换延迟到L1缓存丢失之后。然而,如何有效地解决同义问题一直是阻碍虚拟缓存广泛应用的关键问题。基于虚拟缓存的概念,本研究提出了一种将虚拟缓存扩展到整个缓存层次的混合虚拟内存架构,旨在提高性能和能耗。混合虚拟缓存在缓存层次结构中使用带有地址空间标识符(ASID)的虚拟地址,用于常见的非同义地址。对于这种非同义词,地址转换只在最后一级缓存(LLC)丢失后发生。对于非常见的同义词地址,在L1缓存访问之前,这些地址被用传统的tlb转换为物理地址。为了支持这种混合翻译,我们提出了一种高效的基于Bloom过滤器的同义词检测机制,该机制可以在很少误报的情况下识别候选同义词。对于大内存应用程序,延迟转换本身不能解决地址转换问题,因为固定粒度的delayedtlb可能无法随着内存需求的增加而扩展。为了缓解翻译的可扩展性问题,本研究提出了一种针对混合虚拟缓存的延迟多段翻译。实验结果表明,我们的方法有效地降低了对tlb的访问,从而显著节省了功耗。此外,该方法还提供了可变长度段的可伸缩延迟平移的性能改进。
{"title":"Efficient synonym filtering and scalable delayed translation for hybrid virtual caching","authors":"Chang Hyun Park, Taekyung Heo, Jaehyuk Huh","doi":"10.1145/3007787.3001160","DOIUrl":"https://doi.org/10.1145/3007787.3001160","url":null,"abstract":"Conventional translation look-aside buffers(TLBs) are required to complete address translation withshort latencies, as the address translation is on the criticalpath of all memory accesses even for L1 cache hits. Such strictTLB latency restrictions limit the TLB capacity, as the latencyincrease with large TLBs may lower the overall performanceeven with potential TLB miss reductions. Furthermore, TLBsconsume a significant amount of energy as they are accessedfor every instruction fetch and data access. To avoid thelatency restriction and reduce the energy consumption, virtualcaching techniques have been proposed to defer translation toafter L1 cache misses. However, an efficient solution for thesynonym problem has been a critical issue hindering the wideadoption of virtual caching.Based on the virtual caching concept, this study proposes ahybrid virtual memory architecture extending virtual cachingto the entire cache hierarchy, aiming to improve both performanceand energy consumption. The hybrid virtual cachinguses virtual addresses augmented with address space identifiers(ASID) in the cache hierarchy for common non-synonymaddresses. For such non-synonyms, the address translationoccurs only after last-level cache (LLC) misses. For uncommonsynonym addresses, the addresses are translated to physicaladdresses with conventional TLBs before L1 cache accesses. Tosupport such hybrid translation, we propose an efficient synonymdetection mechanism based on Bloom filters which canidentify synonym candidates with few false positives. For largememory applications, delayed translation alone cannot solvethe address translation problem, as fixed-granularity delayedTLBs may not scale with the increasing memory requirements.To mitigate the translation scalability problem, this studyproposes a delayed many segment translation designed for thehybrid virtual caching. The experimental results show that ourapproach effectively lowers accesses to the TLBs, leading tosignificant power savings. In addition, the approach providesperformance improvement with scalable delayed translationwith variable length segments.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"97 1","pages":"217-229"},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90632704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1