Thomas Shull, Jiho Choi, M. Garzarán, J. Torrellas
Scripting languages’ inferior performance stems from compilers lacking enough static information. To address this limitation, they use JIT compilers organized into multiple tiers, with higher tiers using profiling information to generate high-performance code. Checks are inserted to detect incorrect assumptions and, when a check fails, execution transfers to a lower tier. The points of potential transfer between tiers are called Stack Map Points (SMPs). They require a consistent state in both tiers and, hence, limit code optimization across SMPs in the higher tier. This paper examines the code generated by a state-of-theart JavaScript compiler and finds that the code has a high frequency of SMPs. These SMPs rarely cause execution to transfer to lower tiers. However, both the optimization-limiting effect of the SMPs, and the overhead of the SMP-guarding checks contribute to scripting languages’ low performance. To tackle this problem, we extend the compiler to generate hardware transactions around SMPs, and perform simple within-transaction optimizations enabled by transactions. We target emerging lightweight HTM systems and call our changes NoMap. We evaluate NoMap on the SunSpider and Kraken suites. We find that NoMap lowers the instruction count by an average of 14.2% and 11.5%, and the execution time by an average of 16.7% and 8.9%, for SunSpider and Kraken, respectively. Keywords-JavaScript; Transactional Memory; Compiler Optimizations; JIT Compilation.
{"title":"NoMap: Speeding-Up JavaScript Using Hardware Transactional Memory","authors":"Thomas Shull, Jiho Choi, M. Garzarán, J. Torrellas","doi":"10.1109/HPCA.2019.00054","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00054","url":null,"abstract":"Scripting languages’ inferior performance stems from compilers lacking enough static information. To address this limitation, they use JIT compilers organized into multiple tiers, with higher tiers using profiling information to generate high-performance code. Checks are inserted to detect incorrect assumptions and, when a check fails, execution transfers to a lower tier. The points of potential transfer between tiers are called Stack Map Points (SMPs). They require a consistent state in both tiers and, hence, limit code optimization across SMPs in the higher tier. This paper examines the code generated by a state-of-theart JavaScript compiler and finds that the code has a high frequency of SMPs. These SMPs rarely cause execution to transfer to lower tiers. However, both the optimization-limiting effect of the SMPs, and the overhead of the SMP-guarding checks contribute to scripting languages’ low performance. To tackle this problem, we extend the compiler to generate hardware transactions around SMPs, and perform simple within-transaction optimizations enabled by transactions. We target emerging lightweight HTM systems and call our changes NoMap. We evaluate NoMap on the SunSpider and Kraken suites. We find that NoMap lowers the instruction count by an average of 14.2% and 11.5%, and the execution time by an average of 16.7% and 8.9%, for SunSpider and Kraken, respectively. Keywords-JavaScript; Transactional Memory; Compiler Optimizations; JIT Compilation.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128539894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shrikanth Ganapathy, J. Kalamatianos, Bradford M. Beckmann, Steven E. Raasch, Lukasz G. Szafaryn
Supply voltage (VDD) scaling is one of the most effective mechanisms to reduce energy consumption in highperformance microprocessors. However, VDD scaling is challenging for SRAM-based on-chip memories such as caches due to persistent failures at low voltage (LV). Previously designed LV-enabling mechanisms require additional Memory Built-in Self-Test (MBIST) steps, employed either offline or online to identify persistent failures for every LV operating mode. However, these additional MBIST steps are time consuming, resulting in extended boot time or delayed power state transitions. Furthermore, most prior techniques combine MBIST-based solutions with customized Error Correction Codes (ECC), which suffer from non-trivial area or performance overheads. In this paper, we highlight the practical challenges for deploying LV techniques and propose a new low-cost error protection scheme, called Killi, which leverages conventional ECC and parity to enable LV operation. Foremost, the failing lines are discovered dynamically at runtime using both parity and ECC, negating the need for extra MBIST testing. Killi then provides on demand error protection by decoupling cheap error detection from expensive error correction. Killi provides error detection capability to all lines using parity but employs Single Error Correction, Double Error Detection (SECDED) ECC for a subset of the lines with a single LV fault. All lines with more than one fault are disabled. We evaluate this completely hardware enclosed solution on a GPU write-through L2 cache and show that the Vmin (minimum reliable VDD) can be reduced to 62.5% of nominal VDD when operating at 1GHz with only a maximum of 0.8% performance degradation. As a result, an 8CU GPU with Killi can reduce the power consumption of the L2 cache by 59.3% compared to the baseline L2 cache running at nominal VDD. In addition, Killi reduces the error protection area overhead by 50% compared to SECDED ECC. Keywords—cache, energy-efficiency, GPU, low voltage,
{"title":"Killi: Runtime Fault Classification to Deploy Low Voltage Caches without MBIST","authors":"Shrikanth Ganapathy, J. Kalamatianos, Bradford M. Beckmann, Steven E. Raasch, Lukasz G. Szafaryn","doi":"10.1109/HPCA.2019.00046","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00046","url":null,"abstract":"Supply voltage (VDD) scaling is one of the most effective mechanisms to reduce energy consumption in highperformance microprocessors. However, VDD scaling is challenging for SRAM-based on-chip memories such as caches due to persistent failures at low voltage (LV). Previously designed LV-enabling mechanisms require additional Memory Built-in Self-Test (MBIST) steps, employed either offline or online to identify persistent failures for every LV operating mode. However, these additional MBIST steps are time consuming, resulting in extended boot time or delayed power state transitions. Furthermore, most prior techniques combine MBIST-based solutions with customized Error Correction Codes (ECC), which suffer from non-trivial area or performance overheads. In this paper, we highlight the practical challenges for deploying LV techniques and propose a new low-cost error protection scheme, called Killi, which leverages conventional ECC and parity to enable LV operation. Foremost, the failing lines are discovered dynamically at runtime using both parity and ECC, negating the need for extra MBIST testing. Killi then provides on demand error protection by decoupling cheap error detection from expensive error correction. Killi provides error detection capability to all lines using parity but employs Single Error Correction, Double Error Detection (SECDED) ECC for a subset of the lines with a single LV fault. All lines with more than one fault are disabled. We evaluate this completely hardware enclosed solution on a GPU write-through L2 cache and show that the Vmin (minimum reliable VDD) can be reduced to 62.5% of nominal VDD when operating at 1GHz with only a maximum of 0.8% performance degradation. As a result, an 8CU GPU with Killi can reduce the power consumption of the L2 cache by 59.3% compared to the baseline L2 cache running at nominal VDD. In addition, Killi reduces the error protection area overhead by 50% compared to SECDED ECC. Keywords—cache, energy-efficiency, GPU, low voltage,","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124281899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saptadeep Pal, Daniel Petrisko, Matthew Tomei, Puneet Gupta, S. Iyer, Rakesh Kumar
Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing. However, waferscale processors [1], [2], [3] have been historically deemed impractical due to yield issues [1], [4] inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF) [5], [6], [7], where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today’s architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages (up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies. We observe 100% of the inter-die interconnects to be successfully connected in our prototype. Coupled with the high yield reported previously for bonding of dies on Si-IF, this demonstrates the technological readiness for building a waferscale GPU architecture. Keywords—Waferscale Processors, GPU, Silicon Interconnect Fabric
{"title":"Architecting Waferscale Processors - A GPU Case Study","authors":"Saptadeep Pal, Daniel Petrisko, Matthew Tomei, Puneet Gupta, S. Iyer, Rakesh Kumar","doi":"10.1109/HPCA.2019.00042","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00042","url":null,"abstract":"Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing. However, waferscale processors [1], [2], [3] have been historically deemed impractical due to yield issues [1], [4] inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF) [5], [6], [7], where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today’s architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages (up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies. We observe 100% of the inter-die interconnects to be successfully connected in our prototype. Coupled with the high yield reported previously for bonding of dies on Si-IF, this demonstrates the technological readiness for building a waferscale GPU architecture. Keywords—Waferscale Processors, GPU, Silicon Interconnect Fabric","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"14 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133177617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Helen Li, Yiran Chen
With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, inspired by recent work in machine learning systems, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.
{"title":"HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array","authors":"Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Helen Li, Yiran Chen","doi":"10.1109/HPCA.2019.00027","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00027","url":null,"abstract":"With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, inspired by recent work in machine learning systems, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124930496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recurrent Neural Networks (RNNs) are becoming increasingly important for time series-related applications which require efficient and real-time implementations. The two major types are Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. It is a challenging task to have real-time, efficient, and accurate hardware RNN implementations because of the high sensitivity to imprecision accumulation and the requirement of special activation function implementations. A key limitation of the prior works is the lack of a systematic design optimization framework of RNN model and hardware implementations, especially when the block size (or compression ratio) should be jointly optimized with RNN type, layer size, etc. In this paper, we adopt the block-circulant matrix-based framework, and present the Efficient RNN (E-RNN) framework for FPGA implementations of the Automatic Speech Recognition (ASR) application. The overall goal is to improve performance/energy efficiency under accuracy requirement. We use the alternating direction method of multipliers (ADMM) technique for more accurate block-circulant training, and present two design explorations providing guidance on block size and reducing RNN training trials. Based on the two observations, we decompose E-RNN in two phases: Phase I on determining RNN model to reduce computation and storage subject to accuracy requirement, and Phase II on hardware implementations given RNN model, including processing element design/optimization, quantization, activation implementation, etc. Experimental results on actual FPGA deployments show that E-RNN achieves a maximum energy efficiency improvement of 37.4$times$ compared with ESE, and more than 2$times$ compared with C-LSTM, under the same accuracy.
{"title":"E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs","authors":"Zhe Li, Caiwen Ding, Siyue Wang, Wujie Wen, Youwei Zhuo, Chang Liu, Qinru Qiu, Wenyao Xu, X. Lin, Xuehai Qian, Yanzhi Wang","doi":"10.1109/HPCA.2019.00028","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00028","url":null,"abstract":"Recurrent Neural Networks (RNNs) are becoming increasingly important for time series-related applications which require efficient and real-time implementations. The two major types are Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. It is a challenging task to have real-time, efficient, and accurate hardware RNN implementations because of the high sensitivity to imprecision accumulation and the requirement of special activation function implementations. \u0000A key limitation of the prior works is the lack of a systematic design optimization framework of RNN model and hardware implementations, especially when the block size (or compression ratio) should be jointly optimized with RNN type, layer size, etc. In this paper, we adopt the block-circulant matrix-based framework, and present the Efficient RNN (E-RNN) framework for FPGA implementations of the Automatic Speech Recognition (ASR) application. The overall goal is to improve performance/energy efficiency under accuracy requirement. We use the alternating direction method of multipliers (ADMM) technique for more accurate block-circulant training, and present two design explorations providing guidance on block size and reducing RNN training trials. Based on the two observations, we decompose E-RNN in two phases: Phase I on determining RNN model to reduce computation and storage subject to accuracy requirement, and Phase II on hardware implementations given RNN model, including processing element design/optimization, quantization, activation implementation, etc. Experimental results on actual FPGA deployments show that E-RNN achieves a maximum energy efficiency improvement of 37.4$times$ compared with ESE, and more than 2$times$ compared with C-LSTM, under the same accuracy.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133079629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern societies have developed insatiable demands for more computation capabilities. Exploiting implicit parallelism to provide automatic performance improvement remains a central goal in engineering future general-purpose computing systems. One approach is to use a separate thread context to perform continuous look-ahead to improve the data and instruction supply to the main pipeline. Such a decoupled look-ahead (DLA) architecture can be quite effective in accelerating a broad range of applications in a relatively straightforward implementation. It also has broad design flexibility as the look-ahead agent need not be concerned with correctness constraints. In this paper, we explore a number of optimizations that make the look-ahead agent more efficient and yet extract more utility from it. With these optimizations, a DLA architecture can achieve an average speedup of 1.4 over a state-of-the-art microarchitecture for a broad set of benchmark suites, making it a powerful tool to enhance single-thread performance.
{"title":"R3-DLA (Reduce, Reuse, Recycle): A More Efficient Approach to Decoupled Look-Ahead Architectures","authors":"Sushant Kondguli, Michael C. Huang","doi":"10.1109/HPCA.2019.00064","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00064","url":null,"abstract":"Modern societies have developed insatiable demands for more computation capabilities. Exploiting implicit parallelism to provide automatic performance improvement remains a central goal in engineering future general-purpose computing systems. One approach is to use a separate thread context to perform continuous look-ahead to improve the data and instruction supply to the main pipeline. Such a decoupled look-ahead (DLA) architecture can be quite effective in accelerating a broad range of applications in a relatively straightforward implementation. It also has broad design flexibility as the look-ahead agent need not be concerned with correctness constraints. In this paper, we explore a number of optimizations that make the look-ahead agent more efficient and yet extract more utility from it. With these optimizations, a DLA architecture can achieve an average speedup of 1.4 over a state-of-the-art microarchitecture for a broad set of benchmark suites, making it a powerful tool to enhance single-thread performance.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116510289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We are pleased to welcome you to the 26th International Symposium on Computer Architecture and High Performance Computing SBAC-PAD 2014. SBACPAD will be held for the first time in Europe at University Pierre et Marie Curie, Paris, France. This year, we have an outstanding program composed of 43 high quality papers. Also, three highly distinguished researchers Henri Bal (Vrije Universiteit, The Netherlands), William Blake (D-Wave Systems Inc, Canada), and John Goodacre (ARM, UK) will provide us with exciting keynote talks. In addition, we will have three associated international events: the 5th Workshop on Architecture and Multi-Core Applications (WAMCA), the Special Edition of MPP workshop on Data-programming models and machines, and Workshop on Parallel and Distributed Computing for Big Data Applications (WPBA). We are honored to share with MPP workshop, the talk of Michael Flynn (Stanford, USA) and Arvind (MIT, USA). We would like to thank the many people who contributed to make the SBACPAD 2014 a success. First of all, we would like to thank Alfredo Goldman (University of Sao Paulo, Brazil) and Laxmikant Kale (University of Illinois at Urbana-Champaign, USA) the Program Chairs, the Track Chairs and the Program Committees for their splendid work in selecting the papers. We also would like to thank and congratulate the authors for their successful efforts. The help of the members of the Steering Committee in solving problems that arise during the conference organization was most appreciated. Crucial help came from our Colleagues of the Organizing Committee; thank you all. We also would like to express our gratitude to our sponsors: the Brazilian Computer Society (SBC), the IEEE Computer Society, Inria, CNRS and LIP6 lab. and our industrial sponsors Bull, Maxeler, and Nvidia. It has been a pleasure and honor to cooperate with the above mentioned people and many others who have supported our activities to make this event successful. We wish you a great conference and a wonderful stay in Paris.
{"title":"Message from the General Chairs","authors":"K. Saeed, A. Marasinghe","doi":"10.1109/hpca.2019.00005","DOIUrl":"https://doi.org/10.1109/hpca.2019.00005","url":null,"abstract":"We are pleased to welcome you to the 26th International Symposium on Computer Architecture and High Performance Computing SBAC-PAD 2014. SBACPAD will be held for the first time in Europe at University Pierre et Marie Curie, Paris, France. This year, we have an outstanding program composed of 43 high quality papers. Also, three highly distinguished researchers Henri Bal (Vrije Universiteit, The Netherlands), William Blake (D-Wave Systems Inc, Canada), and John Goodacre (ARM, UK) will provide us with exciting keynote talks. In addition, we will have three associated international events: the 5th Workshop on Architecture and Multi-Core Applications (WAMCA), the Special Edition of MPP workshop on Data-programming models and machines, and Workshop on Parallel and Distributed Computing for Big Data Applications (WPBA). We are honored to share with MPP workshop, the talk of Michael Flynn (Stanford, USA) and Arvind (MIT, USA). We would like to thank the many people who contributed to make the SBACPAD 2014 a success. First of all, we would like to thank Alfredo Goldman (University of Sao Paulo, Brazil) and Laxmikant Kale (University of Illinois at Urbana-Champaign, USA) the Program Chairs, the Track Chairs and the Program Committees for their splendid work in selecting the papers. We also would like to thank and congratulate the authors for their successful efforts. The help of the members of the Steering Committee in solving problems that arise during the conference organization was most appreciated. Crucial help came from our Colleagues of the Organizing Committee; thank you all. We also would like to express our gratitude to our sponsors: the Brazilian Computer Society (SBC), the IEEE Computer Society, Inria, CNRS and LIP6 lab. and our industrial sponsors Bull, Maxeler, and Nvidia. It has been a pleasure and honor to cooperate with the above mentioned people and many others who have supported our activities to make this event successful. We wish you a great conference and a wonderful stay in Paris.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128406806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeremie S. Kim, Minesh Patel, Hasan Hassan, Lois Orosa, O. Mutlu
We propose a new DRAM-based true random number generator (TRNG) that leverages DRAM cells as an entropy source. The key idea is to intentionally violate the DRAM access timing parameters and use the resulting errors as the source of randomness. Our technique specifically decreases the DRAM row activation latency (timing parameter tRCD) below manufacturer-recommended specifications, to induce read errors, or activation failures, that exhibit true random behavior. We then aggregate the resulting data from multiple cells to obtain a TRNG capable of providing a high throughput of random numbers at low latency. To demonstrate that our TRNG design is viable using commodity DRAM chips, we rigorously characterize the behavior of activation failures in 282 state-of-the-art LPDDR4 devices from three major DRAM manufacturers. We verify our observations using four additional DDR3 DRAM devices from the same manufacturers. Our results show that many cells in each device produce random data that remains robust over both time and temperature variation. We use our observations to develop D-RanGe, a methodology for extracting true random numbers from commodity DRAM devices with high throughput and low latency by deliberately violating the read access timing parameters. We evaluate the quality of our TRNG using the commonly-used NIST statistical test suite for randomness and find that D-RaNGe: 1) successfully passes each test, and 2) generates true random numbers with over two orders of magnitude higher throughput than the previous highest-throughput DRAM-based TRNG.
{"title":"D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput","authors":"Jeremie S. Kim, Minesh Patel, Hasan Hassan, Lois Orosa, O. Mutlu","doi":"10.1109/HPCA.2019.00011","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00011","url":null,"abstract":"We propose a new DRAM-based true random number generator (TRNG) that leverages DRAM cells as an entropy source. The key idea is to intentionally violate the DRAM access timing parameters and use the resulting errors as the source of randomness. Our technique specifically decreases the DRAM row activation latency (timing parameter tRCD) below manufacturer-recommended specifications, to induce read errors, or activation failures, that exhibit true random behavior. We then aggregate the resulting data from multiple cells to obtain a TRNG capable of providing a high throughput of random numbers at low latency. \u0000To demonstrate that our TRNG design is viable using commodity DRAM chips, we rigorously characterize the behavior of activation failures in 282 state-of-the-art LPDDR4 devices from three major DRAM manufacturers. We verify our observations using four additional DDR3 DRAM devices from the same manufacturers. Our results show that many cells in each device produce random data that remains robust over both time and temperature variation. We use our observations to develop D-RanGe, a methodology for extracting true random numbers from commodity DRAM devices with high throughput and low latency by deliberately violating the read access timing parameters. We evaluate the quality of our TRNG using the commonly-used NIST statistical test suite for randomness and find that D-RaNGe: 1) successfully passes each test, and 2) generates true random numbers with over two orders of magnitude higher throughput than the previous highest-throughput DRAM-based TRNG.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131100019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martí Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, P. Marcuello, Antonio González
GPUs are one of the most energy-consuming components for real-time rendering applications, since a large number of fragment shading computations and memory accesses are involved. Main memory bandwidth is especially taxing battery-operated devices such as smartphones. Tile-Based Rendering GPUs divide the screen space into multiple tiles that are independently rendered in on-chip buffers, thus reducing memory bandwidth and energy consumption. We have observed that, in many animated graphics workloads, a large number of screen tiles have the same color across adjacent frames. In this paper, we propose Rendering Elimination (RE), a novel micro-architectural technique that accurately determines if a tile will be identical to the same tile in the preceding frame before rasterization by means of comparing signatures. Since RE identifies redundant tiles early in the graphics pipeline, it completely avoids the computation and memory accesses of the most power consuming stages of the pipeline, which substantially reduces the execution time and the energy consumption of the GPU. For widely used Android applications, we show that RE achieves an average speedup of 1.74x and energy reduction of 43% for the GPU/Memory system, surpassing by far the benefits of Transaction Elimination, a state-of-the-art memory bandwidth reduction technique available in some commercial Tile-Based Rendering GPUs.
{"title":"Rendering Elimination: Early Discard of Redundant Tiles in the Graphics Pipeline","authors":"Martí Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, P. Marcuello, Antonio González","doi":"10.1109/HPCA.2019.00014","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00014","url":null,"abstract":"GPUs are one of the most energy-consuming components for real-time rendering applications, since a large number of fragment shading computations and memory accesses are involved. Main memory bandwidth is especially taxing battery-operated devices such as smartphones. Tile-Based Rendering GPUs divide the screen space into multiple tiles that are independently rendered in on-chip buffers, thus reducing memory bandwidth and energy consumption. We have observed that, in many animated graphics workloads, a large number of screen tiles have the same color across adjacent frames. In this paper, we propose Rendering Elimination (RE), a novel micro-architectural technique that accurately determines if a tile will be identical to the same tile in the preceding frame before rasterization by means of comparing signatures. Since RE identifies redundant tiles early in the graphics pipeline, it completely avoids the computation and memory accesses of the most power consuming stages of the pipeline, which substantially reduces the execution time and the energy consumption of the GPU. For widely used Android applications, we show that RE achieves an average speedup of 1.74x and energy reduction of 43% for the GPU/Memory system, surpassing by far the benefits of Transaction Elimination, a state-of-the-art memory bandwidth reduction technique available in some commercial Tile-Based Rendering GPUs.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125726971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud multi-tenancy is typically constrained to a single interactive service colocated with one or more batch, low-priority services, whose performance can be sacrificed when deemed necessary. Approximate computing applications offer the opportunity to enable tighter colocation among multiple applications whose performance is important. We present Pliant, a lightweight cloud runtime that leverages the ability of approximate computing applications to tolerate some loss in their output quality to boost the utilization of shared servers. During periods of high resource contention, Pliant employs incremental and interference-aware approximation to reduce contention in shared resources, and prevent QoS violations for co-scheduled interactive, latency-critical services. We evaluate Pliant across different interactive and approximate computing applications, and show that it preserves QoS for all co-scheduled workloads, while incurring a 2.1% loss in output quality, on average.
{"title":"Pliant: Leveraging Approximation to Improve Datacenter Resource Efficiency","authors":"Neeraj Kulkarni, Feng Qi, Christina Delimitrou","doi":"10.1109/HPCA.2019.00035","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00035","url":null,"abstract":"Cloud multi-tenancy is typically constrained to a single interactive service colocated with one or more batch, low-priority services, whose performance can be sacrificed when deemed necessary. Approximate computing applications offer the opportunity to enable tighter colocation among multiple applications whose performance is important. We present Pliant, a lightweight cloud runtime that leverages the ability of approximate computing applications to tolerate some loss in their output quality to boost the utilization of shared servers. During periods of high resource contention, Pliant employs incremental and interference-aware approximation to reduce contention in shared resources, and prevent QoS violations for co-scheduled interactive, latency-critical services. We evaluate Pliant across different interactive and approximate computing applications, and show that it preserves QoS for all co-scheduled workloads, while incurring a 2.1% loss in output quality, on average.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130871068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}