Pub Date : 2019-03-25DOI: 10.23919/DATE.2019.8714773
Alexander van der Grinten, Henning Meyerhenke
Network science methodology is increasingly applied to a large variety of real-world phenomena. Thus, network data sets with millions or billions of edges are more and more common. To process and analyze such graphs, we need appropriate graph processing systems and fast algorithms. Many analysis algorithms have been pioneered, however, on small networks when speed was not the highest concern. Developing an analysis toolkit for large-scale networks thus often requires faster variants, both from an algorithmic and an implementation perspective.In this paper we focus on computational aspects of vertex centrality measures. Such measures indicate the importance of a vertex based on the position of the vertex in the network. We describe several common measures as well as algorithms for computing them. The description has two foci: (i) our recent contributions to the field and (ii) possible future work, particularly regarding lower-level implementation.
{"title":"Scaling up Network Centrality Computations *","authors":"Alexander van der Grinten, Henning Meyerhenke","doi":"10.23919/DATE.2019.8714773","DOIUrl":"https://doi.org/10.23919/DATE.2019.8714773","url":null,"abstract":"Network science methodology is increasingly applied to a large variety of real-world phenomena. Thus, network data sets with millions or billions of edges are more and more common. To process and analyze such graphs, we need appropriate graph processing systems and fast algorithms. Many analysis algorithms have been pioneered, however, on small networks when speed was not the highest concern. Developing an analysis toolkit for large-scale networks thus often requires faster variants, both from an algorithmic and an implementation perspective.In this paper we focus on computational aspects of vertex centrality measures. Such measures indicate the importance of a vertex based on the position of the vertex in the network. We describe several common measures as well as algorithms for computing them. The description has two foci: (i) our recent contributions to the field and (ii) possible future work, particularly regarding lower-level implementation.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115228511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-25DOI: 10.23919/DATE.2019.8715266
Florian Glaser, Germain Haugou, D. Rossi, Qiuting Huang, L. Benini
Parallel ultra low power computing is emerging as an enabler to meet the growing performance and energy efficiency demands in deeply embedded systems such as the end-nodes of the internet-of-things (IoT). The parallel nature of these systems however adds a significant degree of complexity as processing elements (PEs) need to communicate in various ways to organize and synchronize execution. Naive implementations of these central and non-trivial mechanisms can quickly jeopardize overall system performance and limit the achievable speedup and energy efficiency. To avoid this bottleneck, we present an event-based solution centered around a technology-independent, light-weight and scalable (up to 16 cores) synchronization and communication unit (SCU) and its integration into a shared-memory multicore cluster. Careful design and tight coupling of the SCU to the data interfaces of the cores allows to execute common synchronization procedures with a single instruction. Furthermore, we present hardware support for the common barrier and lock synchronization primitives with a barrier latency of only eleven cycles, independent of the number of involved cores. We demonstrate the efficiency of the solution based on experiments with a post-layout implementation of the multicore cluster in a 22 nm CMOS process where the SCU constitutes less than 2 % of area overhead. Our solution supports parallel sections as small as 100 or 72 cycles with a synchronization overhead of just 10 %, an improvement of up to 14× or 30× with respect to cycle count or energy, respectively, compared to a test-and-set based implementation.
{"title":"Hardware-Accelerated Energy-Efficient Synchronization and Communication for Ultra-Low-Power Tightly Coupled Clusters","authors":"Florian Glaser, Germain Haugou, D. Rossi, Qiuting Huang, L. Benini","doi":"10.23919/DATE.2019.8715266","DOIUrl":"https://doi.org/10.23919/DATE.2019.8715266","url":null,"abstract":"Parallel ultra low power computing is emerging as an enabler to meet the growing performance and energy efficiency demands in deeply embedded systems such as the end-nodes of the internet-of-things (IoT). The parallel nature of these systems however adds a significant degree of complexity as processing elements (PEs) need to communicate in various ways to organize and synchronize execution. Naive implementations of these central and non-trivial mechanisms can quickly jeopardize overall system performance and limit the achievable speedup and energy efficiency. To avoid this bottleneck, we present an event-based solution centered around a technology-independent, light-weight and scalable (up to 16 cores) synchronization and communication unit (SCU) and its integration into a shared-memory multicore cluster. Careful design and tight coupling of the SCU to the data interfaces of the cores allows to execute common synchronization procedures with a single instruction. Furthermore, we present hardware support for the common barrier and lock synchronization primitives with a barrier latency of only eleven cycles, independent of the number of involved cores. We demonstrate the efficiency of the solution based on experiments with a post-layout implementation of the multicore cluster in a 22 nm CMOS process where the SCU constitutes less than 2 % of area overhead. Our solution supports parallel sections as small as 100 or 72 cycles with a synchronization overhead of just 10 %, an improvement of up to 14× or 30× with respect to cycle count or energy, respectively, compared to a test-and-set based implementation.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114217929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-25DOI: 10.23919/DATE.2019.8715088
J. V. Lunteren, R. Luijten, D. Diamantopoulos, F. Auernhammer, C. Hagleitner, Lorenzo Chelini, Stefano Corda, Gagandeep Singh
Application and technology trends are increasingly forcing computer systems to be designed for specific workloads and application domains. Although memory is one of the key components impacting the performance and power consumption of state-of-art computer systems, its operation typically cannot be adapted to workload characteristics beyond some limited controller configuration options. In this paper, we present a novel near-memory acceleration platform based on an Access Processor that enables the main memory system operation to be programmed and adapted dynamically to the accelerated workload. The platform targets both ASIC and FPGA implementations integrated within IBM POWER systems. We show how this platform can be applied to accelerate stencil processing.
{"title":"Coherently Attached Programmable Near-Memory Acceleration Platform and its application to Stencil Processing","authors":"J. V. Lunteren, R. Luijten, D. Diamantopoulos, F. Auernhammer, C. Hagleitner, Lorenzo Chelini, Stefano Corda, Gagandeep Singh","doi":"10.23919/DATE.2019.8715088","DOIUrl":"https://doi.org/10.23919/DATE.2019.8715088","url":null,"abstract":"Application and technology trends are increasingly forcing computer systems to be designed for specific workloads and application domains. Although memory is one of the key components impacting the performance and power consumption of state-of-art computer systems, its operation typically cannot be adapted to workload characteristics beyond some limited controller configuration options. In this paper, we present a novel near-memory acceleration platform based on an Access Processor that enables the main memory system operation to be programmed and adapted dynamically to the accelerated workload. The platform targets both ASIC and FPGA implementations integrated within IBM POWER systems. We show how this platform can be applied to accelerate stencil processing.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125660941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-25DOI: 10.23919/DATE.2019.8715071
Ahmet Turan Erozan, R. Bishnoi, J. Aghassi‐Hagmann, M. Tahoori
Printed electronics (PE) is a fast growing technology with promising applications in wearables, smart sensors and smart cards since it provides mechanical flexibility, low-cost, on-demand and customizable fabrication. To secure the operation of these applications, True Random Number Generators (TRNGs) are required to generate unpredictable bits for cryptographic functions and padding. However, since the additive fabrication process of PE circuits results in high intrinsic variation due to the random dispersion of the printed inks on the substrate, constructing a printed TRNG is challenging. In this paper, we exploit the additive customizable fabrication feature of inkjet printing to design a TRNG based on electrolyte-gated field effect transistors (EGFETs). The proposed memory-based TRNG circuit can operate at low voltages (≤ 1 V ), it is hence suitable for low-power applications. We also propose a flow which tunes the printed resistors of the TRNG circuit to mitigate the overall process variation of the TRNG so that the generated bits are mostly based on the random noise in the circuit, providing a true random behaviour. The results show that the overall process variation of the TRNGs is mitigated by 110 times, and the simulated TRNGs pass the National Institute of Standards and Technology Statistical Test Suite.
{"title":"Inkjet-Printed True Random Number Generator based on Additive Resistor Tuning","authors":"Ahmet Turan Erozan, R. Bishnoi, J. Aghassi‐Hagmann, M. Tahoori","doi":"10.23919/DATE.2019.8715071","DOIUrl":"https://doi.org/10.23919/DATE.2019.8715071","url":null,"abstract":"Printed electronics (PE) is a fast growing technology with promising applications in wearables, smart sensors and smart cards since it provides mechanical flexibility, low-cost, on-demand and customizable fabrication. To secure the operation of these applications, True Random Number Generators (TRNGs) are required to generate unpredictable bits for cryptographic functions and padding. However, since the additive fabrication process of PE circuits results in high intrinsic variation due to the random dispersion of the printed inks on the substrate, constructing a printed TRNG is challenging. In this paper, we exploit the additive customizable fabrication feature of inkjet printing to design a TRNG based on electrolyte-gated field effect transistors (EGFETs). The proposed memory-based TRNG circuit can operate at low voltages (≤ 1 V ), it is hence suitable for low-power applications. We also propose a flow which tunes the printed resistors of the TRNG circuit to mitigate the overall process variation of the TRNG so that the generated bits are mostly based on the random noise in the circuit, providing a true random behaviour. The results show that the overall process variation of the TRNGs is mitigated by 110 times, and the simulated TRNGs pass the National Institute of Standards and Technology Statistical Test Suite.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124101241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-25DOI: 10.23919/DATE.2019.8715261
R. Wille, R. V. Meter, Y. Naveh
Quantum computers promise substantial speedups over conventional machines for many practical applications. While considered "dreams of the future" for a long time, first quantum computers are available now which can be utilized by anyone. A leading force within this development is IBM Research which launched the IBM Q Experience – the first industrial initiative to build universal quantum computers and make them accessible to a broad audience through cloud access. Along this initiative, the tool Qiskit has been launched which enables researchers, teachers, developers, and general enthusiasts to write corresponding code and to run experiments on those machines. At the same time, this provides an ideal playground for the design automation community which – through Qiskit – can deploy improved solutions e.g. on designing and realizing quantum applications. This special session summary aims to provide an introduction into Qiskit and is showcasing selected success stories on how to work with and develop for it. In addition to that, it provides corresponding references to further readings in terms of tutorials and scientific papers as well as links to publicly available implementations for Qiskit extensions.
{"title":"IBM’s Qiskit Tool Chain: Working with and Developing for Real Quantum Computers","authors":"R. Wille, R. V. Meter, Y. Naveh","doi":"10.23919/DATE.2019.8715261","DOIUrl":"https://doi.org/10.23919/DATE.2019.8715261","url":null,"abstract":"Quantum computers promise substantial speedups over conventional machines for many practical applications. While considered \"dreams of the future\" for a long time, first quantum computers are available now which can be utilized by anyone. A leading force within this development is IBM Research which launched the IBM Q Experience – the first industrial initiative to build universal quantum computers and make them accessible to a broad audience through cloud access. Along this initiative, the tool Qiskit has been launched which enables researchers, teachers, developers, and general enthusiasts to write corresponding code and to run experiments on those machines. At the same time, this provides an ideal playground for the design automation community which – through Qiskit – can deploy improved solutions e.g. on designing and realizing quantum applications. This special session summary aims to provide an introduction into Qiskit and is showcasing selected success stories on how to work with and develop for it. In addition to that, it provides corresponding references to further readings in terms of tutorials and scientific papers as well as links to publicly available implementations for Qiskit extensions.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123612537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-25DOI: 10.23919/DATE.2019.8714861
Arian Maghazeh, Sudipta Chattopadhyay, P. Eles, Zebo Peng
We present a software approach to address the data latency issue for certain GPU applications. Each application is modeled as a kernel graph, where the nodes represent individual GPU kernels and the edges capture data dependencies. Our technique exploits the GPU L2 cache to accelerate parameter passing between the kernels. The key idea is that, instead of having each kernel process the entire input in one invocation, we subdivide the input into fragments (which fit in the cache) and, ideally, process each fragment in one continuous sequence of kernel invocations. Our proposed technique is oblivious to kernel functionalities and requires minimal source code modification. We demonstrate our technique on a full-fledged image processing application and improve the performance on average by 30% over various settings.
{"title":"Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications","authors":"Arian Maghazeh, Sudipta Chattopadhyay, P. Eles, Zebo Peng","doi":"10.23919/DATE.2019.8714861","DOIUrl":"https://doi.org/10.23919/DATE.2019.8714861","url":null,"abstract":"We present a software approach to address the data latency issue for certain GPU applications. Each application is modeled as a kernel graph, where the nodes represent individual GPU kernels and the edges capture data dependencies. Our technique exploits the GPU L2 cache to accelerate parameter passing between the kernels. The key idea is that, instead of having each kernel process the entire input in one invocation, we subdivide the input into fragments (which fit in the cache) and, ideally, process each fragment in one continuous sequence of kernel invocations. Our proposed technique is oblivious to kernel functionalities and requires minimal source code modification. We demonstrate our technique on a full-fledged image processing application and improve the performance on average by 30% over various settings.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122573973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
3D charge trap (CT) triple-level cell (TLC) NAND flash gradually becomes a mainstream storage component due to high storage capacity and performance, but introducing a concern about reliability. Fault tolerance and data management schemes are capable of improving reliability. Designing a more efficient solution, however, needs to understand the reliability characteristics of 3D CT TLC NAND flash. To facilitate such understanding, by exploiting a real-world testing platform, we investigate the reliability characteristics including the raw bit error rate (RBER) and the threshold voltage (Vth) shifting features after suffering from variable disturbances. We give analyses of why these characteristics exist in 3D CT TLC NAND flash. We hope these observations can guide the designers to propose high efficient solutions to the reliability problem.
{"title":"Characterizing the Reliability and Threshold Voltage Shifting of 3D Charge Trap NAND Flash","authors":"Weihua Liu, Fei Wu, Meng Zhang, Yifei Wang, Zhonghai Lu, Xiangfeng Lu, C. Xie","doi":"10.23919/DATE.2019.8714941","DOIUrl":"https://doi.org/10.23919/DATE.2019.8714941","url":null,"abstract":"3D charge trap (CT) triple-level cell (TLC) NAND flash gradually becomes a mainstream storage component due to high storage capacity and performance, but introducing a concern about reliability. Fault tolerance and data management schemes are capable of improving reliability. Designing a more efficient solution, however, needs to understand the reliability characteristics of 3D CT TLC NAND flash. To facilitate such understanding, by exploiting a real-world testing platform, we investigate the reliability characteristics including the raw bit error rate (RBER) and the threshold voltage (Vth) shifting features after suffering from variable disturbances. We give analyses of why these characteristics exist in 3D CT TLC NAND flash. We hope these observations can guide the designers to propose high efficient solutions to the reliability problem.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"231 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122619732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-25DOI: 10.23919/DATE.2019.8714921
D. Mandal, S. Jandhyala, O. J. Omer, G. Kalsi, Biji George, G. Neela, S. Rethinagiri, S. Subramoney, Lance Hacking, J. Radford, E. Jones, B. Kuttanna, Hong Wang
Visual Inertial Odometry (VIO) is used for estimating pose and trajectory of a system and is a foundational requirement in many emerging applications like AR/VR, autonomous navigation in cars, drones and robots. In this paper, we analyze key compute bottlenecks in VIO and present a highly optimized VIO accelerator based on a hardware-software codesign approach. We detail a set of novel micro-architectural techniques that optimize compute, data movement, bandwidth and dynamic power to make it possible to deliver high quality of VIO at ultra-low latency and power required for budget constrained edge devices. By offloading the computation of the critical linear algebra algorithms from the CPU, the accelerator enables high sample rate IMU usage in VIO processing while acceleration of image processing pipe increases precision, robustness and reduces IMU induced drift in final pose estimate. The proposed accelerator requires a small silicon footprint (1.3 mm2 in a 28nm process at 600 MHz), utilizes a modest on-chip shared SRAM (560KB) and achieves 10x speedup over a software-only implementation in terms of image sample-based pose update latency while consuming just 2.2 mW power. In a FPGA implementation, using the EuRoC VIO dataset (VGA 30fps images and 100Hz IMU) the accelerator design achieves pose estimation accuracy (loop closure error) comparable to a software based VIO implementation.
{"title":"Visual Inertial Odometry At the Edge: A Hardware-Software Co-design Approach for Ultra-low Latency and Power","authors":"D. Mandal, S. Jandhyala, O. J. Omer, G. Kalsi, Biji George, G. Neela, S. Rethinagiri, S. Subramoney, Lance Hacking, J. Radford, E. Jones, B. Kuttanna, Hong Wang","doi":"10.23919/DATE.2019.8714921","DOIUrl":"https://doi.org/10.23919/DATE.2019.8714921","url":null,"abstract":"Visual Inertial Odometry (VIO) is used for estimating pose and trajectory of a system and is a foundational requirement in many emerging applications like AR/VR, autonomous navigation in cars, drones and robots. In this paper, we analyze key compute bottlenecks in VIO and present a highly optimized VIO accelerator based on a hardware-software codesign approach. We detail a set of novel micro-architectural techniques that optimize compute, data movement, bandwidth and dynamic power to make it possible to deliver high quality of VIO at ultra-low latency and power required for budget constrained edge devices. By offloading the computation of the critical linear algebra algorithms from the CPU, the accelerator enables high sample rate IMU usage in VIO processing while acceleration of image processing pipe increases precision, robustness and reduces IMU induced drift in final pose estimate. The proposed accelerator requires a small silicon footprint (1.3 mm2 in a 28nm process at 600 MHz), utilizes a modest on-chip shared SRAM (560KB) and achieves 10x speedup over a software-only implementation in terms of image sample-based pose update latency while consuming just 2.2 mW power. In a FPGA implementation, using the EuRoC VIO dataset (VGA 30fps images and 100Hz IMU) the accelerator design achieves pose estimation accuracy (loop closure error) comparable to a software based VIO implementation.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121742298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-25DOI: 10.23919/DATE.2019.8714970
Chen-Ying Hsieh, A. A. Sani, N. Dutt
Heterogeneous architectures are ubiquitous in mobile platforms, with mobile SoCs typically integrating multiple processors along with accelerators such as GPUs (for data-parallel kernels) and DSPs (for signal processing kernels). This strict partitioning of application execution on heterogeneous compute resources often results in underutilization of resources such as DSPs. We present a case study executing a mix of popular data-parallel workloads such as convolutional neural networks (CNNs), computer vision filters and graphics rendering kernels on mobile devices, and show that both performance and energy consumption of mobile platforms can be improved by synergistically deploying these underutilized compute resources. Our experiments on a mobile Snapdragon 835 platform under both single and multiple application scenarios executing the aforementioned workloads demonstrates average performance and energy improvements of 15-46% and 18-80%, respectively, by synergistically deploying all available compute resources, especially the underutilized DSP.
{"title":"The Case for Exploiting Underutilized Resources in Heterogeneous Mobile Architectures","authors":"Chen-Ying Hsieh, A. A. Sani, N. Dutt","doi":"10.23919/DATE.2019.8714970","DOIUrl":"https://doi.org/10.23919/DATE.2019.8714970","url":null,"abstract":"Heterogeneous architectures are ubiquitous in mobile platforms, with mobile SoCs typically integrating multiple processors along with accelerators such as GPUs (for data-parallel kernels) and DSPs (for signal processing kernels). This strict partitioning of application execution on heterogeneous compute resources often results in underutilization of resources such as DSPs. We present a case study executing a mix of popular data-parallel workloads such as convolutional neural networks (CNNs), computer vision filters and graphics rendering kernels on mobile devices, and show that both performance and energy consumption of mobile platforms can be improved by synergistically deploying these underutilized compute resources. Our experiments on a mobile Snapdragon 835 platform under both single and multiple application scenarios executing the aforementioned workloads demonstrates average performance and energy improvements of 15-46% and 18-80%, respectively, by synergistically deploying all available compute resources, especially the underutilized DSP.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121318408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-25DOI: 10.23919/DATE.2019.8714911
Jingweijia Tan, Kaige Yan, S. Song, Xin Fu
This paper presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all the streaming multiprocessors is not the primary performance bottleneck but it does consume a large amount of chip energy. We observe that L2 cache is significantly under-utilized by spending 95.6% of the time storing useless data. If such "dead time" on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Locality Similarity to build an energy-efficient GPU L2 Cache, named LoSCache. Specifically, LoSCache uses the data locality information from a small group of CTAs to dynamically predict the L2-level data re-reference counts of the remaining CTAs. After that, specific L2 cache lines can be powered off if they are predicted to be "dead" after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss.
{"title":"LoSCache: Leveraging Locality Similarity to Build Energy-Efficient GPU L2 Cache","authors":"Jingweijia Tan, Kaige Yan, S. Song, Xin Fu","doi":"10.23919/DATE.2019.8714911","DOIUrl":"https://doi.org/10.23919/DATE.2019.8714911","url":null,"abstract":"This paper presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all the streaming multiprocessors is not the primary performance bottleneck but it does consume a large amount of chip energy. We observe that L2 cache is significantly under-utilized by spending 95.6% of the time storing useless data. If such \"dead time\" on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Locality Similarity to build an energy-efficient GPU L2 Cache, named LoSCache. Specifically, LoSCache uses the data locality information from a small group of CTAs to dynamically predict the L2-level data re-reference counts of the remaining CTAs. After that, specific L2 cache lines can be powered off if they are predicted to be \"dead\" after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129555302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}