2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)最新文献

英文中文

Scaling up Network Centrality Computations * 扩展网络中心性计算*

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2019-03-25 DOI: 10.23919/DATE.2019.8714773

Alexander van der Grinten, Henning Meyerhenke

Network science methodology is increasingly applied to a large variety of real-world phenomena. Thus, network data sets with millions or billions of edges are more and more common. To process and analyze such graphs, we need appropriate graph processing systems and fast algorithms. Many analysis algorithms have been pioneered, however, on small networks when speed was not the highest concern. Developing an analysis toolkit for large-scale networks thus often requires faster variants, both from an algorithmic and an implementation perspective.In this paper we focus on computational aspects of vertex centrality measures. Such measures indicate the importance of a vertex based on the position of the vertex in the network. We describe several common measures as well as algorithms for computing them. The description has two foci: (i) our recent contributions to the field and (ii) possible future work, particularly regarding lower-level implementation.

网络科学方法论越来越多地应用于各种各样的现实世界现象。因此，具有数百万或数十亿条边的网络数据集越来越普遍。为了处理和分析这样的图，我们需要合适的图处理系统和快速算法。然而，许多分析算法都是在速度不是最重要的小型网络上率先出现的。因此，从算法和实现的角度来看，为大规模网络开发分析工具包通常需要更快的变体。在本文中，我们关注顶点中心性度量的计算方面。这种度量根据顶点在网络中的位置来表示顶点的重要性。我们描述了几种常见的度量以及计算它们的算法。该描述有两个重点:(i)我们最近对该领域的贡献;(ii)未来可能的工作，特别是在较低层次的实施方面。

引用次数: 2

Hardware-Accelerated Energy-Efficient Synchronization and Communication for Ultra-Low-Power Tightly Coupled Clusters 面向超低功耗紧密耦合集群的硬件加速节能同步与通信

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2019-03-25 DOI: 10.23919/DATE.2019.8715266

Florian Glaser, Germain Haugou, D. Rossi, Qiuting Huang, L. Benini

Parallel ultra low power computing is emerging as an enabler to meet the growing performance and energy efficiency demands in deeply embedded systems such as the end-nodes of the internet-of-things (IoT). The parallel nature of these systems however adds a significant degree of complexity as processing elements (PEs) need to communicate in various ways to organize and synchronize execution. Naive implementations of these central and non-trivial mechanisms can quickly jeopardize overall system performance and limit the achievable speedup and energy efficiency. To avoid this bottleneck, we present an event-based solution centered around a technology-independent, light-weight and scalable (up to 16 cores) synchronization and communication unit (SCU) and its integration into a shared-memory multicore cluster. Careful design and tight coupling of the SCU to the data interfaces of the cores allows to execute common synchronization procedures with a single instruction. Furthermore, we present hardware support for the common barrier and lock synchronization primitives with a barrier latency of only eleven cycles, independent of the number of involved cores. We demonstrate the efficiency of the solution based on experiments with a post-layout implementation of the multicore cluster in a 22 nm CMOS process where the SCU constitutes less than 2 % of area overhead. Our solution supports parallel sections as small as 100 or 72 cycles with a synchronization overhead of just 10 %, an improvement of up to 14× or 30× with respect to cycle count or energy, respectively, compared to a test-and-set based implementation.

并行超低功耗计算正在成为满足深度嵌入式系统(如物联网(IoT)的终端节点)日益增长的性能和能源效率需求的推动者。然而，这些系统的并行特性增加了很大程度的复杂性，因为处理元素(pe)需要以各种方式进行通信以组织和同步执行。这些核心和重要机制的幼稚实现可能很快危及整个系统性能，并限制可实现的加速和能源效率。为了避免这一瓶颈，我们提出了一个基于事件的解决方案，该解决方案以技术独立、轻量级和可扩展(最多16核)的同步和通信单元(SCU)为中心，并将其集成到共享内存多核集群中。SCU与核心数据接口的精心设计和紧密耦合允许使用单个指令执行常见的同步过程。此外，我们提供了对通用屏障和锁同步原语的硬件支持，屏障延迟仅为11个周期，与所涉及的内核数量无关。我们通过在22纳米CMOS工艺中多核集群布局后实现的实验证明了该解决方案的效率，其中SCU占面积开销的比例不到2%。我们的解决方案支持小至100或72个周期的并行部分，同步开销仅为10%，与基于测试和设置的实现相比，在周期计数或能量方面分别提高了14倍或30倍。

{"title":"Hardware-Accelerated Energy-Efficient Synchronization and Communication for Ultra-Low-Power Tightly Coupled Clusters","authors":"Florian Glaser, Germain Haugou, D. Rossi, Qiuting Huang, L. Benini","doi":"10.23919/DATE.2019.8715266","DOIUrl":"https://doi.org/10.23919/DATE.2019.8715266","url":null,"abstract":"Parallel ultra low power computing is emerging as an enabler to meet the growing performance and energy efficiency demands in deeply embedded systems such as the end-nodes of the internet-of-things (IoT). The parallel nature of these systems however adds a significant degree of complexity as processing elements (PEs) need to communicate in various ways to organize and synchronize execution. Naive implementations of these central and non-trivial mechanisms can quickly jeopardize overall system performance and limit the achievable speedup and energy efficiency. To avoid this bottleneck, we present an event-based solution centered around a technology-independent, light-weight and scalable (up to 16 cores) synchronization and communication unit (SCU) and its integration into a shared-memory multicore cluster. Careful design and tight coupling of the SCU to the data interfaces of the cores allows to execute common synchronization procedures with a single instruction. Furthermore, we present hardware support for the common barrier and lock synchronization primitives with a barrier latency of only eleven cycles, independent of the number of involved cores. We demonstrate the efficiency of the solution based on experiments with a post-layout implementation of the multicore cluster in a 22 nm CMOS process where the SCU constitutes less than 2 % of area overhead. Our solution supports parallel sections as small as 100 or 72 cycles with a synchronization overhead of just 10 %, an improvement of up to 14× or 30× with respect to cycle count or energy, respectively, compared to a test-and-set based implementation.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114217929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Coherently Attached Programmable Near-Memory Acceleration Platform and its application to Stencil Processing 相干附加可编程近内存加速平台及其在模板处理中的应用

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2019-03-25 DOI: 10.23919/DATE.2019.8715088

J. V. Lunteren, R. Luijten, D. Diamantopoulos, F. Auernhammer, C. Hagleitner, Lorenzo Chelini, Stefano Corda, Gagandeep Singh

Application and technology trends are increasingly forcing computer systems to be designed for specific workloads and application domains. Although memory is one of the key components impacting the performance and power consumption of state-of-art computer systems, its operation typically cannot be adapted to workload characteristics beyond some limited controller configuration options. In this paper, we present a novel near-memory acceleration platform based on an Access Processor that enables the main memory system operation to be programmed and adapted dynamically to the accelerated workload. The platform targets both ASIC and FPGA implementations integrated within IBM POWER systems. We show how this platform can be applied to accelerate stencil processing.

应用程序和技术趋势越来越多地迫使计算机系统为特定的工作负载和应用程序领域设计。尽管内存是影响最先进计算机系统性能和功耗的关键组件之一，但除了一些有限的控制器配置选项外，其操作通常无法适应工作负载特征。在本文中，我们提出了一种新的基于存取处理器的近内存加速平台，使主存储器系统的操作能够被编程并动态地适应加速的工作负载。该平台的目标是集成在IBM POWER系统中的ASIC和FPGA实现。我们展示了如何应用这个平台来加速模板处理。

引用次数: 9

Inkjet-Printed True Random Number Generator based on Additive Resistor Tuning 基于附加电阻调谐的喷墨打印真随机数发生器

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2019-03-25 DOI: 10.23919/DATE.2019.8715071

Ahmet Turan Erozan, R. Bishnoi, J. Aghassi‐Hagmann, M. Tahoori

Printed electronics (PE) is a fast growing technology with promising applications in wearables, smart sensors and smart cards since it provides mechanical flexibility, low-cost, on-demand and customizable fabrication. To secure the operation of these applications, True Random Number Generators (TRNGs) are required to generate unpredictable bits for cryptographic functions and padding. However, since the additive fabrication process of PE circuits results in high intrinsic variation due to the random dispersion of the printed inks on the substrate, constructing a printed TRNG is challenging. In this paper, we exploit the additive customizable fabrication feature of inkjet printing to design a TRNG based on electrolyte-gated field effect transistors (EGFETs). The proposed memory-based TRNG circuit can operate at low voltages (≤ 1 V ), it is hence suitable for low-power applications. We also propose a flow which tunes the printed resistors of the TRNG circuit to mitigate the overall process variation of the TRNG so that the generated bits are mostly based on the random noise in the circuit, providing a true random behaviour. The results show that the overall process variation of the TRNGs is mitigated by 110 times, and the simulated TRNGs pass the National Institute of Standards and Technology Statistical Test Suite.

印刷电子(PE)是一项快速发展的技术，在可穿戴设备、智能传感器和智能卡方面有着广阔的应用前景，因为它提供了机械灵活性、低成本、按需和可定制的制造。为了确保这些应用程序的操作安全，需要真随机数生成器(trng)为加密功能和填充生成不可预测的位。然而，由于PE电路的增材制造工艺由于印刷油墨在衬底上的随机分散而导致高固有变化，因此构建印刷TRNG具有挑战性。本文利用喷墨打印的增材可定制制造特性，设计了一种基于电解门控场效应晶体管(egfet)的TRNG。所提出的基于存储器的TRNG电路可以在低电压(≤1 V)下工作，因此适用于低功耗应用。我们还提出了一个流程，该流程可调谐TRNG电路的印刷电阻，以减轻TRNG的整体过程变化，以便生成的位主要基于电路中的随机噪声，提供真正的随机行为。结果表明，模拟的trng总体工艺变化减小了110倍，并通过了美国国家标准与技术研究院统计测试套件。

{"title":"Inkjet-Printed True Random Number Generator based on Additive Resistor Tuning","authors":"Ahmet Turan Erozan, R. Bishnoi, J. Aghassi‐Hagmann, M. Tahoori","doi":"10.23919/DATE.2019.8715071","DOIUrl":"https://doi.org/10.23919/DATE.2019.8715071","url":null,"abstract":"Printed electronics (PE) is a fast growing technology with promising applications in wearables, smart sensors and smart cards since it provides mechanical flexibility, low-cost, on-demand and customizable fabrication. To secure the operation of these applications, True Random Number Generators (TRNGs) are required to generate unpredictable bits for cryptographic functions and padding. However, since the additive fabrication process of PE circuits results in high intrinsic variation due to the random dispersion of the printed inks on the substrate, constructing a printed TRNG is challenging. In this paper, we exploit the additive customizable fabrication feature of inkjet printing to design a TRNG based on electrolyte-gated field effect transistors (EGFETs). The proposed memory-based TRNG circuit can operate at low voltages (≤ 1 V ), it is hence suitable for low-power applications. We also propose a flow which tunes the printed resistors of the TRNG circuit to mitigate the overall process variation of the TRNG so that the generated bits are mostly based on the random noise in the circuit, providing a true random behaviour. The results show that the overall process variation of the TRNGs is mitigated by 110 times, and the simulated TRNGs pass the National Institute of Standards and Technology Statistical Test Suite.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124101241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

IBM’s Qiskit Tool Chain: Working with and Developing for Real Quantum Computers IBM的Qiskit工具链:使用和开发真正的量子计算机

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2019-03-25 DOI: 10.23919/DATE.2019.8715261

R. Wille, R. V. Meter, Y. Naveh

Quantum computers promise substantial speedups over conventional machines for many practical applications. While considered "dreams of the future" for a long time, first quantum computers are available now which can be utilized by anyone. A leading force within this development is IBM Research which launched the IBM Q Experience – the first industrial initiative to build universal quantum computers and make them accessible to a broad audience through cloud access. Along this initiative, the tool Qiskit has been launched which enables researchers, teachers, developers, and general enthusiasts to write corresponding code and to run experiments on those machines. At the same time, this provides an ideal playground for the design automation community which – through Qiskit – can deploy improved solutions e.g. on designing and realizing quantum applications. This special session summary aims to provide an introduction into Qiskit and is showcasing selected success stories on how to work with and develop for it. In addition to that, it provides corresponding references to further readings in terms of tutorials and scientific papers as well as links to publicly available implementations for Qiskit extensions.

在许多实际应用中，量子计算机有望大大提高传统计算机的速度。虽然长期以来被认为是“未来的梦想”，但现在第一台量子计算机已经可用，任何人都可以使用。这一发展的主导力量是IBM研究院，它推出了IBM Q体验，这是第一个构建通用量子计算机的工业计划，并通过云访问使广大受众可以访问它们。在这一倡议下，Qiskit工具已经启动，它使研究人员、教师、开发人员和普通爱好者能够编写相应的代码，并在这些机器上运行实验。同时，这为设计自动化社区提供了一个理想的平台，通过Qiskit，可以部署改进的解决方案，例如设计和实现量子应用。这个特别的会议总结旨在介绍Qiskit，并展示如何使用和开发Qiskit的成功案例。除此之外，它还提供了相应的参考资料，以进一步阅读教程和科学论文，以及公开可用的Qiskit扩展实现的链接。

{"title":"IBM’s Qiskit Tool Chain: Working with and Developing for Real Quantum Computers","authors":"R. Wille, R. V. Meter, Y. Naveh","doi":"10.23919/DATE.2019.8715261","DOIUrl":"https://doi.org/10.23919/DATE.2019.8715261","url":null,"abstract":"Quantum computers promise substantial speedups over conventional machines for many practical applications. While considered \"dreams of the future\" for a long time, first quantum computers are available now which can be utilized by anyone. A leading force within this development is IBM Research which launched the IBM Q Experience – the first industrial initiative to build universal quantum computers and make them accessible to a broad audience through cloud access. Along this initiative, the tool Qiskit has been launched which enables researchers, teachers, developers, and general enthusiasts to write corresponding code and to run experiments on those machines. At the same time, this provides an ideal playground for the design automation community which – through Qiskit – can deploy improved solutions e.g. on designing and realizing quantum applications. This special session summary aims to provide an introduction into Qiskit and is showcasing selected success stories on how to work with and develop for it. In addition to that, it provides corresponding references to further readings in terms of tutorials and scientific papers as well as links to publicly available implementations for Qiskit extensions.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123612537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications 缓存感知内核平铺:基于gpu的应用程序的系统级性能优化方法

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2019-03-25 DOI: 10.23919/DATE.2019.8714861

Arian Maghazeh, Sudipta Chattopadhyay, P. Eles, Zebo Peng

We present a software approach to address the data latency issue for certain GPU applications. Each application is modeled as a kernel graph, where the nodes represent individual GPU kernels and the edges capture data dependencies. Our technique exploits the GPU L2 cache to accelerate parameter passing between the kernels. The key idea is that, instead of having each kernel process the entire input in one invocation, we subdivide the input into fragments (which fit in the cache) and, ideally, process each fragment in one continuous sequence of kernel invocations. Our proposed technique is oblivious to kernel functionalities and requires minimal source code modification. We demonstrate our technique on a full-fledged image processing application and improve the performance on average by 30% over various settings.

我们提出了一种软件方法来解决某些GPU应用程序的数据延迟问题。每个应用程序都建模为一个内核图，其中节点表示单个GPU内核，边缘捕获数据依赖关系。我们的技术利用GPU L2缓存来加速内核之间的参数传递。关键思想是，我们不是让每个内核在一次调用中处理整个输入，而是将输入细分为片段(适合缓存)，理想情况下，在一个连续的内核调用序列中处理每个片段。我们建议的技术不涉及内核功能，并且只需要对源代码进行最小的修改。我们在一个成熟的图像处理应用程序上演示了我们的技术，并在各种设置下平均提高了30%的性能。

引用次数: 4

Characterizing the Reliability and Threshold Voltage Shifting of 3D Charge Trap NAND Flash 三维电荷阱NAND闪存可靠性及阈值电压转移特性研究

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2019-03-25 DOI: 10.23919/DATE.2019.8714941

Weihua Liu, Fei Wu, Meng Zhang, Yifei Wang, Zhonghai Lu, Xiangfeng Lu, C. Xie

3D charge trap (CT) triple-level cell (TLC) NAND flash gradually becomes a mainstream storage component due to high storage capacity and performance, but introducing a concern about reliability. Fault tolerance and data management schemes are capable of improving reliability. Designing a more efficient solution, however, needs to understand the reliability characteristics of 3D CT TLC NAND flash. To facilitate such understanding, by exploiting a real-world testing platform, we investigate the reliability characteristics including the raw bit error rate (RBER) and the threshold voltage (Vth) shifting features after suffering from variable disturbances. We give analyses of why these characteristics exist in 3D CT TLC NAND flash. We hope these observations can guide the designers to propose high efficient solutions to the reliability problem.

3D电荷阱(CT)三能级单元(TLC) NAND闪存由于其高存储容量和高性能逐渐成为主流存储器件，但也带来了可靠性问题。容错和数据管理方案能够提高可靠性。然而，设计一个更有效的解决方案需要了解3D CT TLC NAND闪存的可靠性特性。为了促进这种理解，通过利用现实世界的测试平台，我们研究了可靠性特性，包括原始误码率(RBER)和阈值电压(Vth)在遭受可变干扰后的移动特征。我们分析了为什么这些特征存在于3D CT TLC NAND闪存中。我们希望这些观察结果可以指导设计者对可靠性问题提出高效的解决方案。

引用次数: 26

Visual Inertial Odometry At the Edge: A Hardware-Software Co-design Approach for Ultra-low Latency and Power 边缘视觉惯性里程计:一种超低延迟和功耗的软硬件协同设计方法

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2019-03-25 DOI: 10.23919/DATE.2019.8714921

D. Mandal, S. Jandhyala, O. J. Omer, G. Kalsi, Biji George, G. Neela, S. Rethinagiri, S. Subramoney, Lance Hacking, J. Radford, E. Jones, B. Kuttanna, Hong Wang

Visual Inertial Odometry (VIO) is used for estimating pose and trajectory of a system and is a foundational requirement in many emerging applications like AR/VR, autonomous navigation in cars, drones and robots. In this paper, we analyze key compute bottlenecks in VIO and present a highly optimized VIO accelerator based on a hardware-software codesign approach. We detail a set of novel micro-architectural techniques that optimize compute, data movement, bandwidth and dynamic power to make it possible to deliver high quality of VIO at ultra-low latency and power required for budget constrained edge devices. By offloading the computation of the critical linear algebra algorithms from the CPU, the accelerator enables high sample rate IMU usage in VIO processing while acceleration of image processing pipe increases precision, robustness and reduces IMU induced drift in final pose estimate. The proposed accelerator requires a small silicon footprint (1.3 mm2 in a 28nm process at 600 MHz), utilizes a modest on-chip shared SRAM (560KB) and achieves 10x speedup over a software-only implementation in terms of image sample-based pose update latency while consuming just 2.2 mW power. In a FPGA implementation, using the EuRoC VIO dataset (VGA 30fps images and 100Hz IMU) the accelerator design achieves pose estimation accuracy (loop closure error) comparable to a software based VIO implementation.

视觉惯性里程计(VIO)用于估计系统的姿态和轨迹，是许多新兴应用的基本要求，如AR/VR，汽车自动导航，无人机和机器人。在本文中，我们分析了VIO中的关键计算瓶颈，并提出了一个基于硬件软件协同设计方法的高度优化的VIO加速器。我们详细介绍了一组新颖的微架构技术，这些技术可以优化计算、数据移动、带宽和动态功率，从而可以在预算有限的边缘设备所需的超低延迟和功率下提供高质量的VIO。通过将关键线性代数算法的计算从CPU中卸载，加速器在VIO处理中实现了高采样率IMU的使用，而图像处理管道的加速提高了精度，鲁棒性并减少了最终姿态估计中IMU引起的漂移。所提出的加速器需要很小的硅足迹(在600mhz的28nm工艺中为1.3 mm2)，利用适度的片上共享SRAM (560KB)，并且在基于图像样本的姿态更新延迟方面实现了比纯软件实现10倍的加速，而功耗仅为2.2 mW。在FPGA实现中，使用EuRoC VIO数据集(VGA 30fps图像和100Hz IMU)，加速器设计实现了与基于软件的VIO实现相当的姿态估计精度(环路闭合误差)。

{"title":"Visual Inertial Odometry At the Edge: A Hardware-Software Co-design Approach for Ultra-low Latency and Power","authors":"D. Mandal, S. Jandhyala, O. J. Omer, G. Kalsi, Biji George, G. Neela, S. Rethinagiri, S. Subramoney, Lance Hacking, J. Radford, E. Jones, B. Kuttanna, Hong Wang","doi":"10.23919/DATE.2019.8714921","DOIUrl":"https://doi.org/10.23919/DATE.2019.8714921","url":null,"abstract":"Visual Inertial Odometry (VIO) is used for estimating pose and trajectory of a system and is a foundational requirement in many emerging applications like AR/VR, autonomous navigation in cars, drones and robots. In this paper, we analyze key compute bottlenecks in VIO and present a highly optimized VIO accelerator based on a hardware-software codesign approach. We detail a set of novel micro-architectural techniques that optimize compute, data movement, bandwidth and dynamic power to make it possible to deliver high quality of VIO at ultra-low latency and power required for budget constrained edge devices. By offloading the computation of the critical linear algebra algorithms from the CPU, the accelerator enables high sample rate IMU usage in VIO processing while acceleration of image processing pipe increases precision, robustness and reduces IMU induced drift in final pose estimate. The proposed accelerator requires a small silicon footprint (1.3 mm2 in a 28nm process at 600 MHz), utilizes a modest on-chip shared SRAM (560KB) and achieves 10x speedup over a software-only implementation in terms of image sample-based pose update latency while consuming just 2.2 mW power. In a FPGA implementation, using the EuRoC VIO dataset (VGA 30fps images and 100Hz IMU) the accelerator design achieves pose estimation accuracy (loop closure error) comparable to a software based VIO implementation.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121742298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

The Case for Exploiting Underutilized Resources in Heterogeneous Mobile Architectures 在异构移动架构中开发未充分利用资源的案例

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2019-03-25 DOI: 10.23919/DATE.2019.8714970

Chen-Ying Hsieh, A. A. Sani, N. Dutt

Heterogeneous architectures are ubiquitous in mobile platforms, with mobile SoCs typically integrating multiple processors along with accelerators such as GPUs (for data-parallel kernels) and DSPs (for signal processing kernels). This strict partitioning of application execution on heterogeneous compute resources often results in underutilization of resources such as DSPs. We present a case study executing a mix of popular data-parallel workloads such as convolutional neural networks (CNNs), computer vision filters and graphics rendering kernels on mobile devices, and show that both performance and energy consumption of mobile platforms can be improved by synergistically deploying these underutilized compute resources. Our experiments on a mobile Snapdragon 835 platform under both single and multiple application scenarios executing the aforementioned workloads demonstrates average performance and energy improvements of 15-46% and 18-80%, respectively, by synergistically deploying all available compute resources, especially the underutilized DSP.

异构架构在移动平台中无处不在，移动soc通常集成多个处理器以及加速器，如gpu(用于数据并行内核)和dsp(用于信号处理内核)。在异构计算资源上对应用程序执行的严格分区通常会导致对dsp等资源的利用不足。我们提出了一个案例研究，在移动设备上执行流行的数据并行工作负载，如卷积神经网络(cnn)、计算机视觉过滤器和图形渲染内核，并表明通过协同部署这些未充分利用的计算资源，可以提高移动平台的性能和能耗。我们在移动骁龙835平台上的实验显示，通过协同部署所有可用的计算资源，特别是未充分利用的DSP，在执行上述工作负载的单个和多个应用场景下，平均性能和能耗分别提高了15-46%和18-80%。

引用次数: 6

LoSCache: Leveraging Locality Similarity to Build Energy-Efficient GPU L2 Cache LoSCache:利用局部性相似性构建节能GPU L2缓存

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Pub Date : 2019-03-25 DOI: 10.23919/DATE.2019.8714911

Jingweijia Tan, Kaige Yan, S. Song, Xin Fu

This paper presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all the streaming multiprocessors is not the primary performance bottleneck but it does consume a large amount of chip energy. We observe that L2 cache is significantly under-utilized by spending 95.6% of the time storing useless data. If such "dead time" on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Locality Similarity to build an energy-efficient GPU L2 Cache, named LoSCache. Specifically, LoSCache uses the data locality information from a small group of CTAs to dynamically predict the L2-level data re-reference counts of the remaining CTAs. After that, specific L2 cache lines can be powered off if they are predicted to be "dead" after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss.

本文提出了一种新的高效节能缓存设计，用于大规模并行，面向吞吐量的架构，如gpu。与现代gpu上的L1数据缓存不同，所有流多处理器共享的L2缓存并不是主要的性能瓶颈，但它确实消耗了大量的芯片能量。我们观察到二级缓存的利用率明显不足，95.6%的时间用于存储无用的数据。如果识别并减少L2上的这种“死时间”，则可以大大提高L2的能源效率。幸运的是，我们发现gpu的SIMT编程模型在线程之间提供了一个独特的特性:指令级数据位置相似性，它可以用来准确地预测L2缓存块级别的数据重新引用计数。我们提出了一个简单的设计，利用这种局域相似性来构建一个节能的GPU L2缓存，命名为LoSCache。具体来说，LoSCache使用来自一小组cta的数据位置信息来动态预测剩余cta的l2级数据重引用计数。在那之后，如果特定的L2缓存线路在某些访问之后被预测为“死”，则可以关闭它们。广泛应用的实验结果表明，我们提出的设计可以显着降低L2缓存能量平均64%，而性能损失仅为0.5%。

{"title":"LoSCache: Leveraging Locality Similarity to Build Energy-Efficient GPU L2 Cache","authors":"Jingweijia Tan, Kaige Yan, S. Song, Xin Fu","doi":"10.23919/DATE.2019.8714911","DOIUrl":"https://doi.org/10.23919/DATE.2019.8714911","url":null,"abstract":"This paper presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all the streaming multiprocessors is not the primary performance bottleneck but it does consume a large amount of chip energy. We observe that L2 cache is significantly under-utilized by spending 95.6% of the time storing useless data. If such \"dead time\" on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Locality Similarity to build an energy-efficient GPU L2 Cache, named LoSCache. Specifically, LoSCache uses the data locality information from a small group of CTAs to dynamically predict the L2-level data re-reference counts of the remaining CTAs. After that, specific L2 cache lines can be powered off if they are predicted to be \"dead\" after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss.","PeriodicalId":445778,"journal":{"name":"2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129555302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀