2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第6页

NoMap: Speeding-Up JavaScript Using Hardware Transactional Memory NoMap:使用硬件事务性内存加速JavaScript

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00054

Thomas Shull, Jiho Choi, M. Garzarán, J. Torrellas

Scripting languages’ inferior performance stems from compilers lacking enough static information. To address this limitation, they use JIT compilers organized into multiple tiers, with higher tiers using profiling information to generate high-performance code. Checks are inserted to detect incorrect assumptions and, when a check fails, execution transfers to a lower tier. The points of potential transfer between tiers are called Stack Map Points (SMPs). They require a consistent state in both tiers and, hence, limit code optimization across SMPs in the higher tier. This paper examines the code generated by a state-of-theart JavaScript compiler and finds that the code has a high frequency of SMPs. These SMPs rarely cause execution to transfer to lower tiers. However, both the optimization-limiting effect of the SMPs, and the overhead of the SMP-guarding checks contribute to scripting languages’ low performance. To tackle this problem, we extend the compiler to generate hardware transactions around SMPs, and perform simple within-transaction optimizations enabled by transactions. We target emerging lightweight HTM systems and call our changes NoMap. We evaluate NoMap on the SunSpider and Kraken suites. We find that NoMap lowers the instruction count by an average of 14.2% and 11.5%, and the execution time by an average of 16.7% and 8.9%, for SunSpider and Kraken, respectively. Keywords-JavaScript; Transactional Memory; Compiler Optimizations; JIT Compilation.

脚本语言较差的性能源于编译器缺乏足够的静态信息。为了解决这个限制，他们使用组织成多层的JIT编译器，更高层使用分析信息来生成高性能代码。插入检查以检测不正确的假设，当检查失败时，执行将转移到较低的层。层之间的潜在转移点称为堆栈映射点(SMPs)。它们需要在两个层中保持一致的状态，因此限制了更高层中跨smp的代码优化。本文检查了由最先进的JavaScript编译器生成的代码，并发现该代码具有高频率的smp。这些smp很少导致执行转移到较低的层。然而，smp的优化限制效果和smp保护检查的开销都导致了脚本语言的低性能。为了解决这个问题，我们扩展了编译器以围绕smp生成硬件事务，并执行由事务支持的简单事务内优化。我们的目标是新兴的轻量级HTM系统，并将我们的更改称为NoMap。我们在SunSpider和Kraken套房上评估NoMap。我们发现，对于SunSpider和Kraken, NoMap分别平均减少了14.2%和11.5%的指令计数，平均减少了16.7%和8.9%的执行时间。Keywords-JavaScript;事务性的记忆;编译器优化;JIT编译。

{"title":"NoMap: Speeding-Up JavaScript Using Hardware Transactional Memory","authors":"Thomas Shull, Jiho Choi, M. Garzarán, J. Torrellas","doi":"10.1109/HPCA.2019.00054","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00054","url":null,"abstract":"Scripting languages’ inferior performance stems from compilers lacking enough static information. To address this limitation, they use JIT compilers organized into multiple tiers, with higher tiers using profiling information to generate high-performance code. Checks are inserted to detect incorrect assumptions and, when a check fails, execution transfers to a lower tier. The points of potential transfer between tiers are called Stack Map Points (SMPs). They require a consistent state in both tiers and, hence, limit code optimization across SMPs in the higher tier. This paper examines the code generated by a state-of-theart JavaScript compiler and finds that the code has a high frequency of SMPs. These SMPs rarely cause execution to transfer to lower tiers. However, both the optimization-limiting effect of the SMPs, and the overhead of the SMP-guarding checks contribute to scripting languages’ low performance. To tackle this problem, we extend the compiler to generate hardware transactions around SMPs, and perform simple within-transaction optimizations enabled by transactions. We target emerging lightweight HTM systems and call our changes NoMap. We evaluate NoMap on the SunSpider and Kraken suites. We find that NoMap lowers the instruction count by an average of 14.2% and 11.5%, and the execution time by an average of 16.7% and 8.9%, for SunSpider and Kraken, respectively. Keywords-JavaScript; Transactional Memory; Compiler Optimizations; JIT Compilation.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128539894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Killi: Runtime Fault Classification to Deploy Low Voltage Caches without MBIST Killi:运行时故障分类，部署无MBIST的低压缓存

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00046

Shrikanth Ganapathy, J. Kalamatianos, Bradford M. Beckmann, Steven E. Raasch, Lukasz G. Szafaryn

Supply voltage (VDD) scaling is one of the most effective mechanisms to reduce energy consumption in highperformance microprocessors. However, VDD scaling is challenging for SRAM-based on-chip memories such as caches due to persistent failures at low voltage (LV). Previously designed LV-enabling mechanisms require additional Memory Built-in Self-Test (MBIST) steps, employed either offline or online to identify persistent failures for every LV operating mode. However, these additional MBIST steps are time consuming, resulting in extended boot time or delayed power state transitions. Furthermore, most prior techniques combine MBIST-based solutions with customized Error Correction Codes (ECC), which suffer from non-trivial area or performance overheads. In this paper, we highlight the practical challenges for deploying LV techniques and propose a new low-cost error protection scheme, called Killi, which leverages conventional ECC and parity to enable LV operation. Foremost, the failing lines are discovered dynamically at runtime using both parity and ECC, negating the need for extra MBIST testing. Killi then provides on demand error protection by decoupling cheap error detection from expensive error correction. Killi provides error detection capability to all lines using parity but employs Single Error Correction, Double Error Detection (SECDED) ECC for a subset of the lines with a single LV fault. All lines with more than one fault are disabled. We evaluate this completely hardware enclosed solution on a GPU write-through L2 cache and show that the Vmin (minimum reliable VDD) can be reduced to 62.5% of nominal VDD when operating at 1GHz with only a maximum of 0.8% performance degradation. As a result, an 8CU GPU with Killi can reduce the power consumption of the L2 cache by 59.3% compared to the baseline L2 cache running at nominal VDD. In addition, Killi reduces the error protection area overhead by 50% compared to SECDED ECC. Keywords—cache, energy-efficiency, GPU, low voltage,

在高性能微处理器中，电源电压(VDD)缩放是降低能耗的最有效机制之一。然而，对于基于sram的片上存储器(如缓存)来说，由于低电压(LV)下的持续故障，VDD扩展是具有挑战性的。以前设计的LV启用机制需要额外的内存内置自检(MBIST)步骤，可以离线或在线使用，以识别每个LV工作模式的持续故障。然而，这些额外的MBIST步骤非常耗时，导致引导时间延长或电源状态转换延迟。此外，大多数先前的技术将基于mbist的解决方案与定制的纠错码(ECC)结合在一起，这会带来不小的面积或性能开销。在本文中，我们强调了部署低压技术的实际挑战，并提出了一种新的低成本错误保护方案，称为Killi，它利用传统的ECC和奇偶性来实现低压操作。最重要的是，在运行时使用奇偶校验和ECC动态地发现故障行，从而不需要额外的MBIST测试。然后，Killi通过将廉价的错误检测与昂贵的错误纠正分离，提供按需错误保护。Killi为所有使用奇偶校验的线路提供错误检测能力，但对具有单个低压故障的线路子集采用单错误校正，双错误检测(SECDED) ECC。所有有一个以上故障的线路都被禁用。我们在GPU write-through L2缓存上评估了这种完全硬件封闭的解决方案，并表明当工作在1GHz时，Vmin(最小可靠VDD)可以降低到名义VDD的62.5%，而性能下降最多仅为0.8%。因此，与在标称VDD下运行的基准L2缓存相比，带有Killi的8CU GPU可以将L2缓存的功耗降低59.3%。此外，与SECDED ECC相比，Killi减少了50%的错误保护区域开销。关键词:缓存，能效，GPU，低电压，

{"title":"Killi: Runtime Fault Classification to Deploy Low Voltage Caches without MBIST","authors":"Shrikanth Ganapathy, J. Kalamatianos, Bradford M. Beckmann, Steven E. Raasch, Lukasz G. Szafaryn","doi":"10.1109/HPCA.2019.00046","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00046","url":null,"abstract":"Supply voltage (VDD) scaling is one of the most effective mechanisms to reduce energy consumption in highperformance microprocessors. However, VDD scaling is challenging for SRAM-based on-chip memories such as caches due to persistent failures at low voltage (LV). Previously designed LV-enabling mechanisms require additional Memory Built-in Self-Test (MBIST) steps, employed either offline or online to identify persistent failures for every LV operating mode. However, these additional MBIST steps are time consuming, resulting in extended boot time or delayed power state transitions. Furthermore, most prior techniques combine MBIST-based solutions with customized Error Correction Codes (ECC), which suffer from non-trivial area or performance overheads. In this paper, we highlight the practical challenges for deploying LV techniques and propose a new low-cost error protection scheme, called Killi, which leverages conventional ECC and parity to enable LV operation. Foremost, the failing lines are discovered dynamically at runtime using both parity and ECC, negating the need for extra MBIST testing. Killi then provides on demand error protection by decoupling cheap error detection from expensive error correction. Killi provides error detection capability to all lines using parity but employs Single Error Correction, Double Error Detection (SECDED) ECC for a subset of the lines with a single LV fault. All lines with more than one fault are disabled. We evaluate this completely hardware enclosed solution on a GPU write-through L2 cache and show that the Vmin (minimum reliable VDD) can be reduced to 62.5% of nominal VDD when operating at 1GHz with only a maximum of 0.8% performance degradation. As a result, an 8CU GPU with Killi can reduce the power consumption of the L2 cache by 59.3% compared to the baseline L2 cache running at nominal VDD. In addition, Killi reduces the error protection area overhead by 50% compared to SECDED ECC. Keywords—cache, energy-efficiency, GPU, low voltage,","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124281899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Architecting Waferscale Processors - A GPU Case Study 架构晶圆级处理器- GPU案例研究

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00042

Saptadeep Pal, Daniel Petrisko, Matthew Tomei, Puneet Gupta, S. Iyer, Rakesh Kumar

Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing. However, waferscale processors [1], [2], [3] have been historically deemed impractical due to yield issues [1], [4] inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF) [5], [6], [7], where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today’s architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages (up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies. We observe 100% of the inter-die interconnects to be successfully connected in our prototype. Coupled with the high yield reported previously for bonding of dies on Si-IF, this demonstrates the technological readiness for building a waferscale GPU architecture. Keywords—Waferscale Processors, GPU, Silicon Interconnect Fabric

不断增加的通信开销已经威胁到计算机系统的扩展。大幅度降低通信开销的一种方法是晶圆级处理。然而，由于传统集成技术固有的良率问题[1]，[4]，晶圆级处理器[1]，[2]，[3]一直被认为是不切实际的。新兴的集成技术，如硅互连结构(Si-IF)[5]，[6]，[7]，其中预先制造的模具直接粘合在硅片上，可以使人们建立一个晶圆级系统，而没有相应的良率问题。因此，需要重新审视晶圆标量架构。在本文中，我们研究了在晶圆规模上构建今天的架构是否可行和有用。使用晶圆级GPU作为案例研究，我们表明，虽然300毫米晶圆可以容纳大约100个GPU模块(GPM)，但当考虑到物理问题时，只能构建具有大约40个GPM的大幅缩小的GPU架构。我们还研究了晶圆级架构的性能和能源影响。我们表明，在不改变编程模型的情况下，晶圆级gpu可以提供显着的性能和能效优势(与基于PCB的等效MCM-GPU实现相比，高达18.9倍的加速和143倍的EDP优势)。我们还为晶圆级GPU架构开发线程调度和数据放置策略。在24 GPM和40 GPM情况下，我们的策略分别比最先进的调度和数据放置策略高出2.88倍(平均1.4倍)和1.62倍(平均1.11倍)。最后，我们建立了第一个Si-IF原型与互连的模具。在我们的原型中，我们观察到100%的内部芯片互连成功连接。再加上之前报道的Si-IF上的高成品率，这表明构建晶圆级GPU架构的技术准备就绪。关键词:晶圆级处理器，GPU，硅互连结构

{"title":"Architecting Waferscale Processors - A GPU Case Study","authors":"Saptadeep Pal, Daniel Petrisko, Matthew Tomei, Puneet Gupta, S. Iyer, Rakesh Kumar","doi":"10.1109/HPCA.2019.00042","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00042","url":null,"abstract":"Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing. However, waferscale processors [1], [2], [3] have been historically deemed impractical due to yield issues [1], [4] inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF) [5], [6], [7], where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today’s architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages (up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies. We observe 100% of the inter-die interconnects to be successfully connected in our prototype. Coupled with the high yield reported previously for bonding of dies on Si-IF, this demonstrates the technological readiness for building a waferscale GPU architecture. Keywords—Waferscale Processors, GPU, Silicon Interconnect Fabric","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"14 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133177617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-01-07 DOI: 10.1109/HPCA.2019.00027

Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Helen Li, Yiran Chen

With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, inspired by recent work in machine learning systems, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.

近年来，随着人工智能的兴起，深度神经网络(Deep Neural Networks, dnn)在许多领域得到了广泛的应用。为了实现高性能和节能，深度神经网络的硬件加速(特别是推理)在学术界和工业界都得到了广泛的研究。然而，我们仍然面临两个挑战:大型DNN模型和数据集，导致频繁的片外内存访问;以及深度神经网络的训练，这在最近的加速器设计中没有得到很好的探索。为了真正为深度和大型模型的训练提供高吞吐量和节能的加速，我们不可避免地需要使用多个加速器来探索粗粒度并行性，而不是在大多数现有架构中考虑的层内的细粒度并行性。寻求加速器间计算和数据流的最佳组织方式是研究的关键问题。在本文中，受最近机器学习系统工作的启发，我们提出了一种解决方案HyPar，以确定使用一系列DNN加速器进行深度神经网络训练的分层并行性。HyPar划分了特征映射张量(输入和输出)、核张量、梯度张量和DNN加速器的误差张量。划分构成加权层的并行度选择。优化的目标是在训练一个完整的深度神经网络的过程中，寻找一个使总通信最小化的分区。为了解决这个问题，我们提出了一个通信模型来解释通信的来源和数量。然后，我们使用分层分层的动态规划方法来搜索每一层的分区。

{"title":"HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array","authors":"Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Helen Li, Yiran Chen","doi":"10.1109/HPCA.2019.00027","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00027","url":null,"abstract":"With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, inspired by recent work in machine learning systems, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124930496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 80

E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs fpga中高效递归神经网络的设计优化

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-12-12 DOI: 10.1109/HPCA.2019.00028

Zhe Li, Caiwen Ding, Siyue Wang, Wujie Wen, Youwei Zhuo, Chang Liu, Qinru Qiu, Wenyao Xu, X. Lin, Xuehai Qian, Yanzhi Wang

Recurrent Neural Networks (RNNs) are becoming increasingly important for time series-related applications which require efficient and real-time implementations. The two major types are Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. It is a challenging task to have real-time, efficient, and accurate hardware RNN implementations because of the high sensitivity to imprecision accumulation and the requirement of special activation function implementations. A key limitation of the prior works is the lack of a systematic design optimization framework of RNN model and hardware implementations, especially when the block size (or compression ratio) should be jointly optimized with RNN type, layer size, etc. In this paper, we adopt the block-circulant matrix-based framework, and present the Efficient RNN (E-RNN) framework for FPGA implementations of the Automatic Speech Recognition (ASR) application. The overall goal is to improve performance/energy efficiency under accuracy requirement. We use the alternating direction method of multipliers (ADMM) technique for more accurate block-circulant training, and present two design explorations providing guidance on block size and reducing RNN training trials. Based on the two observations, we decompose E-RNN in two phases: Phase I on determining RNN model to reduce computation and storage subject to accuracy requirement, and Phase II on hardware implementations given RNN model, including processing element design/optimization, quantization, activation implementation, etc. Experimental results on actual FPGA deployments show that E-RNN achieves a maximum energy efficiency improvement of 37.4$times$ compared with ESE, and more than 2$times$ compared with C-LSTM, under the same accuracy.

递归神经网络(rnn)在时间序列相关的应用中变得越来越重要，这些应用需要高效和实时的实现。两种主要类型是长短期记忆(LSTM)和门控循环单元(GRU)网络。由于对不精确积累的高度敏感性和特殊激活函数实现的要求，实现实时、高效、准确的硬件RNN是一项具有挑战性的任务。先前工作的一个关键限制是缺乏一个系统的RNN模型和硬件实现的设计优化框架，特别是当块大小(或压缩比)需要与RNN类型、层大小等共同优化时。本文采用基于块循环矩阵的框架，提出了用于FPGA实现自动语音识别(ASR)应用的高效RNN (E-RNN)框架。总体目标是在精度要求下提高性能/能源效率。我们使用乘法器的交替方向方法(ADMM)技术进行更精确的块循环训练，并提出了两个设计探索，为块大小和减少RNN训练试验提供指导。基于这两个观察结果，我们将E-RNN分解为两个阶段:第一阶段确定RNN模型以减少精度要求下的计算和存储，第二阶段给出RNN模型的硬件实现，包括处理元素的设计/优化、量化、激活实现等。在实际FPGA部署上的实验结果表明，在相同精度下，E-RNN的能量效率比ESE提高了37.4倍，比C-LSTM提高了2倍以上。

{"title":"E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs","authors":"Zhe Li, Caiwen Ding, Siyue Wang, Wujie Wen, Youwei Zhuo, Chang Liu, Qinru Qiu, Wenyao Xu, X. Lin, Xuehai Qian, Yanzhi Wang","doi":"10.1109/HPCA.2019.00028","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00028","url":null,"abstract":"Recurrent Neural Networks (RNNs) are becoming increasingly important for time series-related applications which require efficient and real-time implementations. The two major types are Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. It is a challenging task to have real-time, efficient, and accurate hardware RNN implementations because of the high sensitivity to imprecision accumulation and the requirement of special activation function implementations. \u0000A key limitation of the prior works is the lack of a systematic design optimization framework of RNN model and hardware implementations, especially when the block size (or compression ratio) should be jointly optimized with RNN type, layer size, etc. In this paper, we adopt the block-circulant matrix-based framework, and present the Efficient RNN (E-RNN) framework for FPGA implementations of the Automatic Speech Recognition (ASR) application. The overall goal is to improve performance/energy efficiency under accuracy requirement. We use the alternating direction method of multipliers (ADMM) technique for more accurate block-circulant training, and present two design explorations providing guidance on block size and reducing RNN training trials. Based on the two observations, we decompose E-RNN in two phases: Phase I on determining RNN model to reduce computation and storage subject to accuracy requirement, and Phase II on hardware implementations given RNN model, including processing element design/optimization, quantization, activation implementation, etc. Experimental results on actual FPGA deployments show that E-RNN achieves a maximum energy efficiency improvement of 37.4$times$ compared with ESE, and more than 2$times$ compared with C-LSTM, under the same accuracy.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133079629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

R3-DLA (Reduce, Reuse, Recycle): A More Efficient Approach to Decoupled Look-Ahead Architectures R3-DLA (Reduce, Reuse, Recycle):一种更有效的解耦前瞻性架构方法

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-12-11 DOI: 10.1109/HPCA.2019.00064

Sushant Kondguli, Michael C. Huang

Modern societies have developed insatiable demands for more computation capabilities. Exploiting implicit parallelism to provide automatic performance improvement remains a central goal in engineering future general-purpose computing systems. One approach is to use a separate thread context to perform continuous look-ahead to improve the data and instruction supply to the main pipeline. Such a decoupled look-ahead (DLA) architecture can be quite effective in accelerating a broad range of applications in a relatively straightforward implementation. It also has broad design flexibility as the look-ahead agent need not be concerned with correctness constraints. In this paper, we explore a number of optimizations that make the look-ahead agent more efficient and yet extract more utility from it. With these optimizations, a DLA architecture can achieve an average speedup of 1.4 over a state-of-the-art microarchitecture for a broad set of benchmark suites, making it a powerful tool to enhance single-thread performance.

现代社会对更多的计算能力已经产生了无法满足的需求。利用隐式并行性来提供自动性能改进仍然是工程未来通用计算系统的中心目标。一种方法是使用单独的线程上下文来执行连续的前瞻性，以改进向主管道提供的数据和指令。这种解耦的预见性(DLA)体系结构可以非常有效地以相对简单的实现加速范围广泛的应用程序。它还具有广泛的设计灵活性，因为预检代理不需要考虑正确性约束。在本文中，我们探索了一些优化，使预查代理更高效，并从中提取更多的效用。通过这些优化，对于广泛的基准套件，DLA体系结构可以比最先进的微体系结构实现1.4的平均加速，使其成为增强单线程性能的强大工具。

引用次数: 6

Message from the General Chairs 主席致辞

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-10-01 DOI: 10.1109/hpca.2019.00005

K. Saeed, A. Marasinghe

We are pleased to welcome you to the 26th International Symposium on Computer Architecture and High Performance Computing SBAC-PAD 2014. SBACPAD will be held for the first time in Europe at University Pierre et Marie Curie, Paris, France. This year, we have an outstanding program composed of 43 high quality papers. Also, three highly distinguished researchers Henri Bal (Vrije Universiteit, The Netherlands), William Blake (D-Wave Systems Inc, Canada), and John Goodacre (ARM, UK) will provide us with exciting keynote talks. In addition, we will have three associated international events: the 5th Workshop on Architecture and Multi-Core Applications (WAMCA), the Special Edition of MPP workshop on Data-programming models and machines, and Workshop on Parallel and Distributed Computing for Big Data Applications (WPBA). We are honored to share with MPP workshop, the talk of Michael Flynn (Stanford, USA) and Arvind (MIT, USA). We would like to thank the many people who contributed to make the SBACPAD 2014 a success. First of all, we would like to thank Alfredo Goldman (University of Sao Paulo, Brazil) and Laxmikant Kale (University of Illinois at Urbana-Champaign, USA) the Program Chairs, the Track Chairs and the Program Committees for their splendid work in selecting the papers. We also would like to thank and congratulate the authors for their successful efforts. The help of the members of the Steering Committee in solving problems that arise during the conference organization was most appreciated. Crucial help came from our Colleagues of the Organizing Committee; thank you all. We also would like to express our gratitude to our sponsors: the Brazilian Computer Society (SBC), the IEEE Computer Society, Inria, CNRS and LIP6 lab. and our industrial sponsors Bull, Maxeler, and Nvidia. It has been a pleasure and honor to cooperate with the above mentioned people and many others who have supported our activities to make this event successful. We wish you a great conference and a wonderful stay in Paris.

我们很高兴欢迎您参加第26届计算机体系结构与高性能计算国际研讨会SBAC-PAD 2014。SBACPAD将首次在欧洲法国巴黎的皮埃尔和玛丽居里大学举行。今年，我们有一个由43篇高质量论文组成的优秀项目。此外，三位非常杰出的研究人员Henri Bal(荷兰自由大学)，William Blake (D-Wave Systems Inc，加拿大)和John Goodacre (ARM，英国)将为我们提供激动人心的主题演讲。此外，我们还将举办三个相关的国际活动:第五届架构与多核应用研讨会(WAMCA)、MPP数据编程模型和机器特别版研讨会(WPBA)和并行和分布式计算大数据应用研讨会(WPBA)。我们很荣幸与MPP工作坊分享Michael Flynn(美国斯坦福大学)和Arvind(美国麻省理工学院)的演讲。我们要感谢为SBACPAD 2014的成功做出贡献的许多人。首先，我们要感谢Alfredo Goldman(巴西圣保罗大学)和Laxmikant Kale(美国伊利诺伊大学厄巴纳-香槟分校)项目主席、田径主席和项目委员会在选择论文方面所做的出色工作。我们也要感谢并祝贺作者的成功努力。指导委员会成员在解决会议组织期间出现的问题方面所提供的帮助是最受赞赏的。组委会的同事们提供了至关重要的帮助;谢谢大家。我们也要感谢我们的赞助者:巴西计算机协会(SBC)、IEEE计算机协会、Inria、CNRS和LIP6实验室。以及我们的行业赞助商Bull、Maxeler和Nvidia。我很高兴也很荣幸能与上述人士以及许多其他支持我们活动的人合作，使这次活动取得成功。我们祝您在巴黎的会议愉快，旅途愉快。

{"title":"Message from the General Chairs","authors":"K. Saeed, A. Marasinghe","doi":"10.1109/hpca.2019.00005","DOIUrl":"https://doi.org/10.1109/hpca.2019.00005","url":null,"abstract":"We are pleased to welcome you to the 26th International Symposium on Computer Architecture and High Performance Computing SBAC-PAD 2014. SBACPAD will be held for the first time in Europe at University Pierre et Marie Curie, Paris, France. This year, we have an outstanding program composed of 43 high quality papers. Also, three highly distinguished researchers Henri Bal (Vrije Universiteit, The Netherlands), William Blake (D-Wave Systems Inc, Canada), and John Goodacre (ARM, UK) will provide us with exciting keynote talks. In addition, we will have three associated international events: the 5th Workshop on Architecture and Multi-Core Applications (WAMCA), the Special Edition of MPP workshop on Data-programming models and machines, and Workshop on Parallel and Distributed Computing for Big Data Applications (WPBA). We are honored to share with MPP workshop, the talk of Michael Flynn (Stanford, USA) and Arvind (MIT, USA). We would like to thank the many people who contributed to make the SBACPAD 2014 a success. First of all, we would like to thank Alfredo Goldman (University of Sao Paulo, Brazil) and Laxmikant Kale (University of Illinois at Urbana-Champaign, USA) the Program Chairs, the Track Chairs and the Program Committees for their splendid work in selecting the papers. We also would like to thank and congratulate the authors for their successful efforts. The help of the members of the Steering Committee in solving problems that arise during the conference organization was most appreciated. Crucial help came from our Colleagues of the Organizing Committee; thank you all. We also would like to express our gratitude to our sponsors: the Brazilian Computer Society (SBC), the IEEE Computer Society, Inria, CNRS and LIP6 lab. and our industrial sponsors Bull, Maxeler, and Nvidia. It has been a pleasure and honor to cooperate with the above mentioned people and many others who have supported our activities to make this event successful. We wish you a great conference and a wonderful stay in Paris.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128406806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput D-RaNGe:使用商品DRAM设备生成具有低延迟和高吞吐量的真随机数

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-08-13 DOI: 10.1109/HPCA.2019.00011

Jeremie S. Kim, Minesh Patel, Hasan Hassan, Lois Orosa, O. Mutlu

We propose a new DRAM-based true random number generator (TRNG) that leverages DRAM cells as an entropy source. The key idea is to intentionally violate the DRAM access timing parameters and use the resulting errors as the source of randomness. Our technique specifically decreases the DRAM row activation latency (timing parameter tRCD) below manufacturer-recommended specifications, to induce read errors, or activation failures, that exhibit true random behavior. We then aggregate the resulting data from multiple cells to obtain a TRNG capable of providing a high throughput of random numbers at low latency. To demonstrate that our TRNG design is viable using commodity DRAM chips, we rigorously characterize the behavior of activation failures in 282 state-of-the-art LPDDR4 devices from three major DRAM manufacturers. We verify our observations using four additional DDR3 DRAM devices from the same manufacturers. Our results show that many cells in each device produce random data that remains robust over both time and temperature variation. We use our observations to develop D-RanGe, a methodology for extracting true random numbers from commodity DRAM devices with high throughput and low latency by deliberately violating the read access timing parameters. We evaluate the quality of our TRNG using the commonly-used NIST statistical test suite for randomness and find that D-RaNGe: 1) successfully passes each test, and 2) generates true random numbers with over two orders of magnitude higher throughput than the previous highest-throughput DRAM-based TRNG.

我们提出了一种新的基于DRAM的真随机数生成器(TRNG)，它利用DRAM单元作为熵源。关键思想是故意违反DRAM访问时序参数，并使用由此产生的错误作为随机性的来源。我们的技术将DRAM行激活延迟(时序参数tRCD)降低到低于制造商推荐的规格，从而导致读取错误或激活失败，表现出真正的随机行为。然后，我们聚合来自多个单元的结果数据，以获得能够在低延迟下提供高吞吐量随机数的TRNG。为了证明我们的TRNG设计在使用商品DRAM芯片时是可行的，我们严格地描述了来自三家主要DRAM制造商的282台最先进的LPDDR4设备的激活故障行为。我们使用来自同一制造商的另外四个DDR3 DRAM设备验证了我们的观察结果。我们的研究结果表明，每个设备中的许多细胞产生随机数据，这些数据在时间和温度变化中都保持稳健。我们利用我们的观察结果开发了D-RanGe，这是一种通过故意违反读访问时序参数，从具有高吞吐量和低延迟的商品DRAM设备中提取真随机数的方法。我们使用常用的NIST随机性统计测试套件来评估我们的TRNG的质量，并发现D-RaNGe: 1)成功地通过了每个测试，2)生成的真实随机数的吞吐量比以前最高吞吐量的基于dram的TRNG高两个数量级。

{"title":"D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput","authors":"Jeremie S. Kim, Minesh Patel, Hasan Hassan, Lois Orosa, O. Mutlu","doi":"10.1109/HPCA.2019.00011","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00011","url":null,"abstract":"We propose a new DRAM-based true random number generator (TRNG) that leverages DRAM cells as an entropy source. The key idea is to intentionally violate the DRAM access timing parameters and use the resulting errors as the source of randomness. Our technique specifically decreases the DRAM row activation latency (timing parameter tRCD) below manufacturer-recommended specifications, to induce read errors, or activation failures, that exhibit true random behavior. We then aggregate the resulting data from multiple cells to obtain a TRNG capable of providing a high throughput of random numbers at low latency. \u0000To demonstrate that our TRNG design is viable using commodity DRAM chips, we rigorously characterize the behavior of activation failures in 282 state-of-the-art LPDDR4 devices from three major DRAM manufacturers. We verify our observations using four additional DDR3 DRAM devices from the same manufacturers. Our results show that many cells in each device produce random data that remains robust over both time and temperature variation. We use our observations to develop D-RanGe, a methodology for extracting true random numbers from commodity DRAM devices with high throughput and low latency by deliberately violating the read access timing parameters. We evaluate the quality of our TRNG using the commonly-used NIST statistical test suite for randomness and find that D-RaNGe: 1) successfully passes each test, and 2) generates true random numbers with over two orders of magnitude higher throughput than the previous highest-throughput DRAM-based TRNG.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131100019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 84

Rendering Elimination: Early Discard of Redundant Tiles in the Graphics Pipeline 渲染消除:在图形管道中尽早丢弃多余的贴图

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-07-25 DOI: 10.1109/HPCA.2019.00014

Martí Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, P. Marcuello, Antonio González

GPUs are one of the most energy-consuming components for real-time rendering applications, since a large number of fragment shading computations and memory accesses are involved. Main memory bandwidth is especially taxing battery-operated devices such as smartphones. Tile-Based Rendering GPUs divide the screen space into multiple tiles that are independently rendered in on-chip buffers, thus reducing memory bandwidth and energy consumption. We have observed that, in many animated graphics workloads, a large number of screen tiles have the same color across adjacent frames. In this paper, we propose Rendering Elimination (RE), a novel micro-architectural technique that accurately determines if a tile will be identical to the same tile in the preceding frame before rasterization by means of comparing signatures. Since RE identifies redundant tiles early in the graphics pipeline, it completely avoids the computation and memory accesses of the most power consuming stages of the pipeline, which substantially reduces the execution time and the energy consumption of the GPU. For widely used Android applications, we show that RE achieves an average speedup of 1.74x and energy reduction of 43% for the GPU/Memory system, surpassing by far the benefits of Transaction Elimination, a state-of-the-art memory bandwidth reduction technique available in some commercial Tile-Based Rendering GPUs.

gpu是实时渲染应用程序中最耗能的组件之一，因为涉及到大量的片段着色计算和内存访问。主内存带宽对智能手机等电池供电设备的消耗尤其大。基于Tile-Based Rendering gpu将屏幕空间划分为多个tile，这些tile在片上缓冲区中独立渲染，从而减少内存带宽和能耗。我们观察到，在许多动画图形工作负载中，大量屏幕块在相邻帧之间具有相同的颜色。在本文中，我们提出了渲染消除(RE)，这是一种新的微架构技术，可以通过比较签名来准确地确定光栅化之前的图像是否与前一帧中的相同图像相同。由于RE在图形管道的早期识别了冗余的tile，它完全避免了管道中最耗电阶段的计算和内存访问，这大大减少了GPU的执行时间和能耗。对于广泛使用的Android应用程序，我们表明，RE实现了1.74倍的平均加速和43%的GPU/内存系统能耗减少，远远超过了事务消除的好处，这是一种最先进的内存带宽减少技术，可用于一些商业的基于磁贴的渲染GPU。

{"title":"Rendering Elimination: Early Discard of Redundant Tiles in the Graphics Pipeline","authors":"Martí Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, P. Marcuello, Antonio González","doi":"10.1109/HPCA.2019.00014","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00014","url":null,"abstract":"GPUs are one of the most energy-consuming components for real-time rendering applications, since a large number of fragment shading computations and memory accesses are involved. Main memory bandwidth is especially taxing battery-operated devices such as smartphones. Tile-Based Rendering GPUs divide the screen space into multiple tiles that are independently rendered in on-chip buffers, thus reducing memory bandwidth and energy consumption. We have observed that, in many animated graphics workloads, a large number of screen tiles have the same color across adjacent frames. In this paper, we propose Rendering Elimination (RE), a novel micro-architectural technique that accurately determines if a tile will be identical to the same tile in the preceding frame before rasterization by means of comparing signatures. Since RE identifies redundant tiles early in the graphics pipeline, it completely avoids the computation and memory accesses of the most power consuming stages of the pipeline, which substantially reduces the execution time and the energy consumption of the GPU. For widely used Android applications, we show that RE achieves an average speedup of 1.74x and energy reduction of 43% for the GPU/Memory system, surpassing by far the benefits of Transaction Elimination, a state-of-the-art memory bandwidth reduction technique available in some commercial Tile-Based Rendering GPUs.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125726971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Pliant: Leveraging Approximation to Improve Datacenter Resource Efficiency 柔韧:利用近似提高数据中心资源效率

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-04-12 DOI: 10.1109/HPCA.2019.00035

Neeraj Kulkarni, Feng Qi, Christina Delimitrou

Cloud multi-tenancy is typically constrained to a single interactive service colocated with one or more batch, low-priority services, whose performance can be sacrificed when deemed necessary. Approximate computing applications offer the opportunity to enable tighter colocation among multiple applications whose performance is important. We present Pliant, a lightweight cloud runtime that leverages the ability of approximate computing applications to tolerate some loss in their output quality to boost the utilization of shared servers. During periods of high resource contention, Pliant employs incremental and interference-aware approximation to reduce contention in shared resources, and prevent QoS violations for co-scheduled interactive, latency-critical services. We evaluate Pliant across different interactive and approximate computing applications, and show that it preserves QoS for all co-scheduled workloads, while incurring a 2.1% loss in output quality, on average.

云多租户通常被限制为与一个或多个批处理、低优先级服务共存的单个交互服务，在必要时可以牺牲其性能。近似计算应用程序提供了在性能很重要的多个应用程序之间实现更紧密的托管的机会。我们介绍了轻量级云运行时Pliant，它利用近似计算应用程序的能力来容忍输出质量的一些损失，从而提高共享服务器的利用率。在高资源争用期间，Pliant采用增量和干扰感知近似来减少共享资源中的争用，并防止共同调度的交互式延迟关键服务违反QoS。我们在不同的交互式和近似计算应用程序中评估了Pliant，并表明它为所有共同调度的工作负载保留了QoS，同时平均导致输出质量损失2.1%。

引用次数: 16