2015 International Conference on Parallel Architecture and Compilation (PACT)最新文献_第2页

Using Hybrid Schedules to Safely Outperform Classical Polyhedral Schedules 使用混合调度安全优于经典多面体调度

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.52

Ti Jin

The Polyhedral model is a mathematical framework for programs with affine control loops that enables complex program transformations such as loop permutation and loop tiling to achieve parallelism, data locality and energy efficiency. Polyhedral schedules are widely used by popular polyhedral compilers such as AlphaZ and PLuTo to represent program execution orders. They use barriers to enforce the correct order of execution and usually synchronizations happen more than necessarily. Current research reveals the merit of combining the classical polyhedral schedules and partially ordered schedules manually written by hands with highly target dependent point-wise synchronization mechanisms. However, derivation of a hybrid schedule is tedious and error-prone due to the possibility of deadlocks. Its deviation from any existing standard representation makes program verication the sole responsibility of the programmer. We propose techniques to automate the derivation, verification and code-generation of hybrid schedules. We also demonstrate the convenience and utility of such techniques in resolving the complications associated with current hybrid schedules.

多面体模型是具有仿射控制环的程序的数学框架，可以实现复杂的程序转换，如循环排列和循环平铺，以实现并行性，数据局部性和能源效率。流行的多面体编译器(如AlphaZ和PLuTo)广泛使用多面体调度来表示程序执行顺序。它们使用屏障来强制执行正确的顺序，并且通常同步发生的次数比必要的要多。目前的研究揭示了将经典多面体调度和部分有序调度与高度目标依赖的点同步机制相结合的优点。然而，由于存在死锁的可能性，混合调度的派生过程冗长且容易出错。它与任何现有标准表示的偏差使得程序验证成为程序员的唯一责任。我们提出了一些技术来自动化混合计划的推导、验证和代码生成。我们还展示了这些技术在解决与当前混合调度相关的复杂性方面的便利性和实用性。

{"title":"Using Hybrid Schedules to Safely Outperform Classical Polyhedral Schedules","authors":"Ti Jin","doi":"10.1109/PACT.2015.52","DOIUrl":"https://doi.org/10.1109/PACT.2015.52","url":null,"abstract":"The Polyhedral model is a mathematical framework for programs with affine control loops that enables complex program transformations such as loop permutation and loop tiling to achieve parallelism, data locality and energy efficiency. Polyhedral schedules are widely used by popular polyhedral compilers such as AlphaZ and PLuTo to represent program execution orders. They use barriers to enforce the correct order of execution and usually synchronizations happen more than necessarily. Current research reveals the merit of combining the classical polyhedral schedules and partially ordered schedules manually written by hands with highly target dependent point-wise synchronization mechanisms. However, derivation of a hybrid schedule is tedious and error-prone due to the possibility of deadlocks. Its deviation from any existing standard representation makes program verication the sole responsibility of the programmer. We propose techniques to automate the derivation, verification and code-generation of hybrid schedules. We also demonstrate the convenience and utility of such techniques in resolving the complications associated with current hybrid schedules.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128158166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing 将三维电阻式内存缓存集成到GPGPU中实现高能效数据处理

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.60

Jie Zhang, D. Donofrio, J. Shalf, Myoungsoo Jung

General purpose graphics processing units (GPUs) have become a promising solution to process massive data by taking advantages of multithreading. Thanks to thread-level parallelism, GPU-accelerated applications improve the overall system performance by up to 40 times, compared to CPU-only architecture. However, data-intensive GPU applications often generate large amount of irregular data accesses, which results in cache thrashing and contention problems. The cache thrashing in turn can introduce a large number of off-chip memory accesses, which not only wastes tremendous energy to move data around on-chip cache and off-chip global memory, but also significantly limits system performance due to many stalled load/store instructions. In this work, we redesign the shared last-level cache (LLC) of GPU devices by introducing non-volatile memory (NVM), which can address the cache thrashing issues with low energy consumption. Specifically, we investigate two architectural approaches, one of each employs a 2D planar resistive random-access memory (RRAM) as our baseline NVM-cache and a 3D-stacked RRAM technology. Our baseline NVM-cache replaces the SRAM-based L2 cache with RRAM of similar area size; a memory die consists of eight subarrays, one of which a small fraction of memristor island by constructing 512x512 matrix. Since the feature size of SRAM is around 125 F2 (while that of RRAM around 4 F2), it can offer around 30x bigger storage capacity than the SRAM-based cache. To make our baseline NVM-cache denser, we proposed 3D-stacked NVM-cache, which piles up four memory layers, and each of them has a single pre-decode logic.

通用图形处理单元(gpu)已经成为利用多线程处理海量数据的一种很有前途的解决方案。由于线程级别的并行性，gpu加速的应用程序与仅使用cpu的架构相比，可以将整体系统性能提高40倍。然而，数据密集型GPU应用程序通常会产生大量不规则的数据访问，从而导致缓存抖动和争用问题。缓存抖动反过来又会引入大量的片外内存访问，这不仅浪费了大量的能量来在片内缓存和片外全局内存之间移动数据，而且由于许多加载/存储指令停滞，还严重限制了系统性能。在这项工作中，我们通过引入非易失性存储器(NVM)来重新设计GPU设备的共享最后一级缓存(LLC)，以低能耗解决缓存抖动问题。具体来说，我们研究了两种架构方法，其中一种方法采用二维平面电阻随机存取存储器(RRAM)作为基准nvm缓存和3d堆叠RRAM技术。我们的基准nvm缓存用类似面积大小的RRAM取代基于sram的L2缓存;一个内存芯片由8个子阵列组成，其中一个子阵列通过构造512x512矩阵构成一小部分忆阻岛。由于SRAM的特征大小约为125 F2(而RRAM的特征大小约为4 F2)，因此它可以提供比基于SRAM的缓存大30倍的存储容量。为了使我们的基准nvm缓存更密集，我们提出了3d堆叠的nvm缓存，它堆积了四个内存层，每个层都有一个预解码逻辑。

{"title":"Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing","authors":"Jie Zhang, D. Donofrio, J. Shalf, Myoungsoo Jung","doi":"10.1109/PACT.2015.60","DOIUrl":"https://doi.org/10.1109/PACT.2015.60","url":null,"abstract":"General purpose graphics processing units (GPUs) have become a promising solution to process massive data by taking advantages of multithreading. Thanks to thread-level parallelism, GPU-accelerated applications improve the overall system performance by up to 40 times, compared to CPU-only architecture. However, data-intensive GPU applications often generate large amount of irregular data accesses, which results in cache thrashing and contention problems. The cache thrashing in turn can introduce a large number of off-chip memory accesses, which not only wastes tremendous energy to move data around on-chip cache and off-chip global memory, but also significantly limits system performance due to many stalled load/store instructions. In this work, we redesign the shared last-level cache (LLC) of GPU devices by introducing non-volatile memory (NVM), which can address the cache thrashing issues with low energy consumption. Specifically, we investigate two architectural approaches, one of each employs a 2D planar resistive random-access memory (RRAM) as our baseline NVM-cache and a 3D-stacked RRAM technology. Our baseline NVM-cache replaces the SRAM-based L2 cache with RRAM of similar area size; a memory die consists of eight subarrays, one of which a small fraction of memristor island by constructing 512x512 matrix. Since the feature size of SRAM is around 125 F2 (while that of RRAM around 4 F2), it can offer around 30x bigger storage capacity than the SRAM-based cache. To make our baseline NVM-cache denser, we proposed 3D-stacked NVM-cache, which piles up four memory layers, and each of them has a single pre-decode logic.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127045808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Exploiting Program Semantics to Place Data in Hybrid Memory 利用程序语义在混合存储器中放置数据

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.10

Wei Wei, D. Jiang, S. Mckee, Jin Xiong, Mingyu Chen

Large-memory applications like data analytics and graph processing benefit from extended memory hierarchies, and hybrid DRAM/NVM (non-volatile memory) systems represent an attractive means by which to increase capacity at reasonable performance/energy tradeoffs. Compared to DRAM, NVMs generally have longer latencies and higher energies for writes, which makes careful data placement essential for efficient system operation. Data placement strategies that resort to monitoring all data accesses and migrating objects to dynamically adjust data locations incur high monitoring overhead and unnecessary memory copies due to mispredicted migrations. We find that program semantics (specifically, global access characteristics) can effectively guide initial data placement with respect to memory types, which, in turn, makes run-time migration more efficient. We study a combined offline/online placement scheme that uses access profiling information to place objects statically and then selectively monitors run-time behaviors to optimize placements dynamically. We present a software/hardware cooperative framework, 2PP, and evaluate it with respect to state-of-the-art migratory placement, finding that it improves performance by an average of 12.1%. Furthermore, 2PP improves energy efficiency by up to 51.8%, and by an average of 18.4%. It does so by reducing run-time monitoring and migration overheads.

像数据分析和图形处理这样的大内存应用程序受益于扩展的内存层次结构，而混合DRAM/NVM(非易失性存储器)系统代表了一种有吸引力的方法，通过这种方法可以在合理的性能/能源权衡下增加容量。与DRAM相比，nvm通常具有更长的延迟和更高的写能量，这使得谨慎的数据放置对于有效的系统操作至关重要。采用监视所有数据访问和迁移对象以动态调整数据位置的数据放置策略会由于错误预测迁移而导致较高的监视开销和不必要的内存拷贝。我们发现程序语义(特别是全局访问特征)可以有效地指导相对于内存类型的初始数据放置，这反过来又使运行时迁移更有效。我们研究了一种离线/在线组合放置方案，该方案使用访问分析信息静态放置对象，然后有选择地监控运行时行为以动态优化放置。我们提出了一个软件/硬件合作框架，2PP，并就最先进的迁移安置对其进行了评估，发现它平均提高了12.1%的性能。此外，2PP将能源效率提高了51.8%，平均提高18.4%。它通过减少运行时监视和迁移开销来实现这一点。

{"title":"Exploiting Program Semantics to Place Data in Hybrid Memory","authors":"Wei Wei, D. Jiang, S. Mckee, Jin Xiong, Mingyu Chen","doi":"10.1109/PACT.2015.10","DOIUrl":"https://doi.org/10.1109/PACT.2015.10","url":null,"abstract":"Large-memory applications like data analytics and graph processing benefit from extended memory hierarchies, and hybrid DRAM/NVM (non-volatile memory) systems represent an attractive means by which to increase capacity at reasonable performance/energy tradeoffs. Compared to DRAM, NVMs generally have longer latencies and higher energies for writes, which makes careful data placement essential for efficient system operation. Data placement strategies that resort to monitoring all data accesses and migrating objects to dynamically adjust data locations incur high monitoring overhead and unnecessary memory copies due to mispredicted migrations. We find that program semantics (specifically, global access characteristics) can effectively guide initial data placement with respect to memory types, which, in turn, makes run-time migration more efficient. We study a combined offline/online placement scheme that uses access profiling information to place objects statically and then selectively monitors run-time behaviors to optimize placements dynamically. We present a software/hardware cooperative framework, 2PP, and evaluate it with respect to state-of-the-art migratory placement, finding that it improves performance by an average of 12.1%. Furthermore, 2PP improves energy efficiency by up to 51.8%, and by an average of 18.4%. It does so by reducing run-time monitoring and migration overheads.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133727073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming PENCIL:一种用于加速器编程的平台中立的计算中间语言

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.17

Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, T. Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, A. Betts, A. Donaldson, J. Ketema, J. Absar, S. V. Haastregt, Alexey Kravets, Anton Lokhmotov, R. David, Elnar Hajiyev

Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA is difficult, error-prone, and not performance-portable. Automatic parallelization and domain specific languages (DSLs) have been proposed to hide complexity and regain performance portability. We present PENCIL, a rigorously-defined subset of GNU C99-enriched with additional language constructs-that enables compilers to exploit parallelism and produce highly optimized code when targeting accelerators. PENCIL aims to serve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a PENCIL-to-OpenCL backend using a state-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affine array accesses, generates optimized OpenCL code. To demonstrate the potential and performance portability of PENCIL and the PENCIL-to-OpenCL compiler, we consider a number of image processing kernels, a set of benchmarks from the Rodinia and SHOC suites, and DSL embedding scenarios for linear algebra (BLAS) and signal processing radar applications (SpearDE), and present experimental results for four GPU platforms: AMD Radeon HD 5670 and R9 285, NVIDIA GTX 470, and ARM Mali-T604.

使用低级api和语言(如OpenCL和CUDA)对gpu等加速器进行编程是困难的，容易出错，而且不具有性能可移植性。提出了自动并行化和领域特定语言(dsl)来隐藏复杂性和恢复性能可移植性。我们介绍了PENCIL，它是GNU c99的一个严格定义的子集——丰富了额外的语言结构——它使编译器能够利用并行性，并在针对加速器时生成高度优化的代码。PENCIL的目标是作为库的可移植实现语言和DSL编译器的目标语言。我们使用最先进的多面体编译器实现了PENCIL-to-OpenCL后端。多面体编译器，扩展到处理数据相关的控制流和非仿射数组访问，生成优化的OpenCL代码。为了展示PENCIL和PENCIL- To - opencl编译器的潜力和性能可移植性，我们考虑了许多图像处理内核，一组来自Rodinia和SHOC套件的基准测试，以及线性代数(BLAS)和信号处理雷达应用(SpearDE)的DSL嵌入场景，并给出了四个GPU平台的实验结果:AMD Radeon HD 5670和R9 285, NVIDIA GTX 470和ARM Mali-T604。

{"title":"PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming","authors":"Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, T. Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, A. Betts, A. Donaldson, J. Ketema, J. Absar, S. V. Haastregt, Alexey Kravets, Anton Lokhmotov, R. David, Elnar Hajiyev","doi":"10.1109/PACT.2015.17","DOIUrl":"https://doi.org/10.1109/PACT.2015.17","url":null,"abstract":"Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA is difficult, error-prone, and not performance-portable. Automatic parallelization and domain specific languages (DSLs) have been proposed to hide complexity and regain performance portability. We present PENCIL, a rigorously-defined subset of GNU C99-enriched with additional language constructs-that enables compilers to exploit parallelism and produce highly optimized code when targeting accelerators. PENCIL aims to serve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a PENCIL-to-OpenCL backend using a state-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affine array accesses, generates optimized OpenCL code. To demonstrate the potential and performance portability of PENCIL and the PENCIL-to-OpenCL compiler, we consider a number of image processing kernels, a set of benchmarks from the Rodinia and SHOC suites, and DSL embedding scenarios for linear algebra (BLAS) and signal processing radar applications (SpearDE), and present experimental results for four GPU platforms: AMD Radeon HD 5670 and R9 285, NVIDIA GTX 470, and ARM Mali-T604.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116877946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 100

Throttling Automatic Vectorization: When Less is More 节流自动矢量化:当少即是多

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.32

Vasileios Porpodas, Timothy M. Jones

SIMD vectors are widely adopted in modern general purpose processors as they can boost performance and energy efficiency for certain applications. Compiler-based automatic vectorization is one approach for generating codethat makes efficient use of the SIMD units, and has the benefit of avoiding hand development and platform-specific optimizations. The Superword-Level Parallelism (SLP) vectorization algorithm is the most well-known implementation of automatic vectorization when starting from straight-line scalar code, and is implemented in several major compilers. The existing SLP algorithm greedily packs scalar instructions into vectors starting from stores and traversing the data dependence graph upwards until it reaches loads or non-vectorizable instructions. Choosing whether to vectorize is a one-off decision for the whole graph that has been generated. This, however, is sub-optimal because the graph may contain code that is harmful to vectorization due to the need to move data from scalar registers into vectors. The decision does not consider the potential benefits of throttling the graph by removing this harmful code. In this work we propose asolution to overcome this limitation by introducing Throttled SLP (TSLP), a novel vectorization algorithm that finds the optimal graph to vectorize, forcing vectorization to stop earlier whenever this is beneficial. Our experiments show that TSLP improves performance across a number of kernels extractedfrom widely-used benchmark suites, decreasing execution time compared to SLP by 9% on average and up to 14% in the best case.

SIMD矢量在现代通用处理器中被广泛采用，因为它们可以提高某些应用的性能和能源效率。基于编译器的自动向量化是生成代码的一种方法，可以有效地利用SIMD单元，并且具有避免手工开发和平台特定优化的优点。超字级并行(Superword-Level Parallelism, SLP)矢量化算法是最著名的从直线标量代码开始自动矢量化的实现，并且在几个主要的编译器中实现。现有的SLP算法将标量指令贪婪地打包成从存储开始的向量，并向上遍历数据依赖图，直到到达负载或不可向量化指令。选择是否向量化是对已生成的整个图的一次性决定。然而，这是次优的，因为图可能包含有害于向量化的代码，因为需要将数据从标量寄存器移动到向量。该决定没有考虑通过删除这些有害代码来限制图的潜在好处。在这项工作中，我们提出了一种解决方案，通过引入节流SLP (TSLP)来克服这一限制，节流SLP是一种新的矢量化算法，它可以找到最优的矢量化图，在有利的情况下强制矢量化提前停止。我们的实验表明，TSLP提高了从广泛使用的基准套件中提取的许多内核的性能，与SLP相比，执行时间平均减少9%，在最好的情况下最多减少14%。

{"title":"Throttling Automatic Vectorization: When Less is More","authors":"Vasileios Porpodas, Timothy M. Jones","doi":"10.1109/PACT.2015.32","DOIUrl":"https://doi.org/10.1109/PACT.2015.32","url":null,"abstract":"SIMD vectors are widely adopted in modern general purpose processors as they can boost performance and energy efficiency for certain applications. Compiler-based automatic vectorization is one approach for generating codethat makes efficient use of the SIMD units, and has the benefit of avoiding hand development and platform-specific optimizations. The Superword-Level Parallelism (SLP) vectorization algorithm is the most well-known implementation of automatic vectorization when starting from straight-line scalar code, and is implemented in several major compilers. The existing SLP algorithm greedily packs scalar instructions into vectors starting from stores and traversing the data dependence graph upwards until it reaches loads or non-vectorizable instructions. Choosing whether to vectorize is a one-off decision for the whole graph that has been generated. This, however, is sub-optimal because the graph may contain code that is harmful to vectorization due to the need to move data from scalar registers into vectors. The decision does not consider the potential benefits of throttling the graph by removing this harmful code. In this work we propose asolution to overcome this limitation by introducing Throttled SLP (TSLP), a novel vectorization algorithm that finds the optimal graph to vectorize, forcing vectorization to stop earlier whenever this is beneficial. Our experiments show that TSLP improves performance across a number of kernels extractedfrom widely-used benchmark suites, decreasing execution time compared to SLP by 9% on average and up to 14% in the best case.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123322151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Towards General-Purpose Neural Network Computing 面向通用神经网络计算

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.21

Schuyler Eldridge, Amos Waterland, M. Seltzer, J. Appavoo, A. Joshi

Machine learning is becoming pervasive, decades of research in neural network computation is now being leveraged to learn patterns in data and perform computations that are difficult to express using standard programming approaches. Recent work has demonstrated that custom hardware accelerators for neural network processing can outperform software implementations in both performance and power consumption. However, there is neither an agreed-upon interface to neural network accelerators nor a consensus on neural network hardware implementations. We present a generic set of software/hardware extensions, X-FILES, that allow for the general-purpose integration of feedforward and feedback neural network computation in applications. The interface is independent of the network type, configuration, and implementation. Using these proposed extensions, we demonstrate and evaluate an example dynamically allocated, multi-context neural network accelerator architecture, DANA. We show that the combination of X-FILES and our hardware prototype, DANA, enables generic support and increased throughput for neural-network-based computation in multi-threaded scenarios.

机器学习正变得越来越普遍，几十年来对神经网络计算的研究现在被用来学习数据中的模式，并执行难以用标准编程方法表达的计算。最近的研究表明，用于神经网络处理的定制硬件加速器在性能和功耗方面都优于软件实现。然而，神经网络加速器既没有一个商定的接口，也没有一个关于神经网络硬件实现的共识。我们提出了一套通用的软件/硬件扩展，X-FILES，它允许在应用程序中前馈和反馈神经网络计算的通用集成。接口与网络类型、配置和实现无关。利用这些扩展，我们演示并评估了一个动态分配的多上下文神经网络加速器体系结构DANA。我们展示了X-FILES和我们的硬件原型DANA的结合，为多线程场景下基于神经网络的计算提供了通用支持和更高的吞吐量。

{"title":"Towards General-Purpose Neural Network Computing","authors":"Schuyler Eldridge, Amos Waterland, M. Seltzer, J. Appavoo, A. Joshi","doi":"10.1109/PACT.2015.21","DOIUrl":"https://doi.org/10.1109/PACT.2015.21","url":null,"abstract":"Machine learning is becoming pervasive, decades of research in neural network computation is now being leveraged to learn patterns in data and perform computations that are difficult to express using standard programming approaches. Recent work has demonstrated that custom hardware accelerators for neural network processing can outperform software implementations in both performance and power consumption. However, there is neither an agreed-upon interface to neural network accelerators nor a consensus on neural network hardware implementations. We present a generic set of software/hardware extensions, X-FILES, that allow for the general-purpose integration of feedforward and feedback neural network computation in applications. The interface is independent of the network type, configuration, and implementation. Using these proposed extensions, we demonstrate and evaluate an example dynamically allocated, multi-context neural network accelerator architecture, DANA. We show that the combination of X-FILES and our hardware prototype, DANA, enables generic support and increased throughput for neural-network-based computation in multi-threaded scenarios.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122726240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Runtime Value Numbering: A Profiling Technique to Pinpoint Redundant Computations 运行时值编号:精确定位冗余计算的分析技术

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.29

Shasha Wen, Xu Liu, Milind Chabbi

Redundant computations can severely degrade performance in HPC applications. Redundant computations arise due to various causes such as developers' inattention to performance, inappropriate choice of algorithms, and inefficient code generation, among others. Aliasing, limited optimization scopes, and insensitivity to input and execution contexts act as severe deterrents to static program analysis. Furthermore, static analysis cannot quantify the benefit from redundancy elimination. Consequently, large optimization efforts may yield little or no benefit. To address these limitations, we develop a dynamic profiler to pinpoint and quantify redundant computations in an execution. Our methodology -- Runtime Value Numbering (RVN) -- is based on the classical value numbering technique but works at runtime instead of compile time. RVN works on unmodified, fully-optimized binaries. RVN provides insightful feedback about redundancies and helps developers tune their applications for high performance. Since RVN employs fine-grained instrumentation, it incurs high overhead. We apply several optimizations to reduce the profiling overhead. Guided by the feedback from RVN, we optimize four benchmarks from SPEC CPU2000/2006 suite, the Sweep3D, and NAS Multi Grid (MG). We speed up these programs up to 1.22X. RVN identifies computation redundancies that compilers failed to optimize even with profile guided optimization.

冗余计算会严重降低高性能计算应用程序的性能。冗余计算是由各种原因引起的，比如开发人员对性能的不关注、算法的不恰当选择、低效的代码生成等等。混叠、有限的优化范围以及对输入和执行上下文的不敏感严重阻碍了静态程序分析。此外，静态分析不能量化冗余消除带来的好处。因此，大量的优化工作可能产生很少或没有好处。为了解决这些限制，我们开发了一个动态分析器来精确定位和量化执行中的冗余计算。我们的方法——运行时值编号(RVN)——基于经典的值编号技术，但在运行时而不是编译时工作。RVN工作在未修改的、完全优化的二进制文件上。RVN提供了关于冗余的深刻反馈，并帮助开发人员调整应用程序以获得高性能。由于RVN使用细粒度的检测，因此会产生很高的开销。我们应用了几个优化来减少分析开销。在RVN反馈的指导下，我们优化了SPEC CPU2000/2006套件，Sweep3D和NAS Multi Grid (MG)的四个基准。我们将这些程序加速到1.22倍。RVN识别即使使用配置文件引导优化，编译器也无法优化的计算冗余。

{"title":"Runtime Value Numbering: A Profiling Technique to Pinpoint Redundant Computations","authors":"Shasha Wen, Xu Liu, Milind Chabbi","doi":"10.1109/PACT.2015.29","DOIUrl":"https://doi.org/10.1109/PACT.2015.29","url":null,"abstract":"Redundant computations can severely degrade performance in HPC applications. Redundant computations arise due to various causes such as developers' inattention to performance, inappropriate choice of algorithms, and inefficient code generation, among others. Aliasing, limited optimization scopes, and insensitivity to input and execution contexts act as severe deterrents to static program analysis. Furthermore, static analysis cannot quantify the benefit from redundancy elimination. Consequently, large optimization efforts may yield little or no benefit. To address these limitations, we develop a dynamic profiler to pinpoint and quantify redundant computations in an execution. Our methodology -- Runtime Value Numbering (RVN) -- is based on the classical value numbering technique but works at runtime instead of compile time. RVN works on unmodified, fully-optimized binaries. RVN provides insightful feedback about redundancies and helps developers tune their applications for high performance. Since RVN employs fine-grained instrumentation, it incurs high overhead. We apply several optimizations to reduce the profiling overhead. Guided by the feedback from RVN, we optimize four benchmarks from SPEC CPU2000/2006 suite, the Sweep3D, and NAS Multi Grid (MG). We speed up these programs up to 1.22X. RVN identifies computation redundancies that compilers failed to optimize even with profile guided optimization.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"255 23","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120871978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

OSPREY: Implementation of Memory Consistency Models for Cache Coherence Protocols involving Invalidation-Free Data Access 涉及无失效数据访问的缓存一致性协议的内存一致性模型的实现

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.45

George Kurian, Qingchuan Shi, S. Devadas, O. Khan

Data access in modern processors contributes significantly to the overall performance and energy consumption. Traditionally, data is distributed among the cores through an on-chip cache hierarchy, and each producer/consumer accesses data through its private level-1 cache relying on the cache coherence protocol for consistency. Recently, remote access, a mechanism that reduces energy and latency through word-level access to data anywhere on chip has been proposed. Remote access does not replicate data in the private caches, and thereby removes the need for expensive cache line invalidations or updates. Researchers have implemented remote access as an auxiliary mechanism in cache coherence to improve efficiency. Unfortunately, stronger memory models, such as Intel's TSO, require strict ordering among the loads and stores. This introduces serialization penalties for data classified to be accessed remotely, which hampers each core's ability to optimally exploit memory level parallelism. In this paper we propose a novel timestamp-based scheme to detect memory consistency violations. The proposed scheme enables remote accesses to be issued and completed in parallel while continuously detecting whether any ordering violations have occurred, and rolling back the pipeline state (if needed). We implement our scheme for the locality-aware cache coherence protocol that uses remote access as an auxiliary mechanism for efficient data access. Our evaluation using a 64-core multicore processor with out-of-order speculative cores shows that the proposed technique improves completion time by 26% and energy by 20% over a state-of-the-art cache management scheme.

现代处理器中的数据访问对整体性能和能耗有很大贡献。传统上，数据通过片上缓存层次结构分布在内核之间，每个生产者/消费者通过其私有的1级缓存访问数据，依赖于缓存一致性协议来保持一致性。最近，远程访问，一种通过字级访问芯片上任何地方的数据来减少能量和延迟的机制被提出。远程访问不复制私有缓存中的数据，因此不需要昂贵的缓存线失效或更新。研究人员将远程访问作为缓存一致性的辅助机制来提高效率。不幸的是，更强大的内存模型，如英特尔的TSO，需要在加载和存储之间严格排序。这将对分类为远程访问的数据引入序列化惩罚，从而妨碍每个核心最佳地利用内存级并行性的能力。本文提出了一种新的基于时间戳的内存一致性检测方案。所提出的方案使远程访问能够并行地发出和完成，同时连续地检测是否发生了任何顺序违反，并回滚管道状态(如果需要)。我们为位置感知缓存一致性协议实现了我们的方案，该协议使用远程访问作为有效数据访问的辅助机制。我们使用带有乱序推测核的64核多核处理器进行的评估表明，与最先进的缓存管理方案相比，所提出的技术可将完成时间提高26%，并将能量提高20%。

{"title":"OSPREY: Implementation of Memory Consistency Models for Cache Coherence Protocols involving Invalidation-Free Data Access","authors":"George Kurian, Qingchuan Shi, S. Devadas, O. Khan","doi":"10.1109/PACT.2015.45","DOIUrl":"https://doi.org/10.1109/PACT.2015.45","url":null,"abstract":"Data access in modern processors contributes significantly to the overall performance and energy consumption. Traditionally, data is distributed among the cores through an on-chip cache hierarchy, and each producer/consumer accesses data through its private level-1 cache relying on the cache coherence protocol for consistency. Recently, remote access, a mechanism that reduces energy and latency through word-level access to data anywhere on chip has been proposed. Remote access does not replicate data in the private caches, and thereby removes the need for expensive cache line invalidations or updates. Researchers have implemented remote access as an auxiliary mechanism in cache coherence to improve efficiency. Unfortunately, stronger memory models, such as Intel's TSO, require strict ordering among the loads and stores. This introduces serialization penalties for data classified to be accessed remotely, which hampers each core's ability to optimally exploit memory level parallelism. In this paper we propose a novel timestamp-based scheme to detect memory consistency violations. The proposed scheme enables remote accesses to be issued and completed in parallel while continuously detecting whether any ordering violations have occurred, and rolling back the pipeline state (if needed). We implement our scheme for the locality-aware cache coherence protocol that uses remote access as an auxiliary mechanism for efficient data access. Our evaluation using a 64-core multicore processor with out-of-order speculative cores shows that the proposed technique improves completion time by 26% and energy by 20% over a state-of-the-art cache management scheme.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126659383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

DVFS-Aware Consolidation for Energy-Efficient Clouds 支持dvfs的节能云整合

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.59

Patricia Arroba, Jose M. Moya, J. Ayala, R. Buyya

Nowadays, data centers consume about 2% of the worldwide energy production, originating more than 43 million tons of CO2 per year. Cloud providers need to implement an energy-efficient management of physical resources in order to meet the growing demand for their services and ensure minimal costs. From the application-framework viewpoint, Cloud workloads present additional restrictions as 24/7 availability, and SLA constraints among others. Also, workload variation impacts on the performance of two of the main strategies for energy-efficiency in Cloud data centers: Dynamic Voltage and Frequency Scaling (DVFS) and Consolidation. Our work proposes two contributions: 1) a DVFS policy that takes into account the trade-offs between energy consumption and performance degradation; 2) a novel consolidation algorithm that is aware of the frequency that would be necessary when allocating a Cloud workload in order to maintain QoS. Our results demonstrate that including DVFS awareness in workload management provides substantial energy savings of up to 39.14% for scenarios under dynamic workload conditions.

如今，数据中心消耗了全球约2%的能源生产，每年产生超过4300万吨的二氧化碳。云提供商需要对物理资源实施节能管理，以满足对其服务日益增长的需求，并确保将成本降至最低。从应用程序框架的角度来看，云工作负载提供了额外的限制，如24/7可用性和SLA约束等。此外，工作负载变化会影响云数据中心中两种主要能效策略的性能:动态电压和频率缩放(DVFS)和整合。我们的工作提出了两个贡献:1)考虑到能耗和性能下降之间权衡的DVFS政策;2)一种新的整合算法，它知道在分配云工作负载以保持QoS时所需的频率。我们的研究结果表明，在工作负载管理中包含DVFS意识可以为动态工作负载条件下的场景节省高达39.14%的能源。

引用次数: 39

Cosmology and Computers: HACCing the Universe 宇宙学和计算机:探索宇宙

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.50

S. Habib

Summary form only given. Deep and wide surveys of the sky have led to a remarkable set of discoveries in cosmology. As the survey volumes become so large that statistical uncertainties almost disappear, cosmological modeling must reach unprecedented levels of scale and accuracy to properly interpret observational results. I will describe the key scientific problems and issues involved and then present the HACC (Hardware/Hybrid Accelerated Cosmology Code) framework, designed around a portable particle-based simulation model for the required, very high dynamic range applications. I will briefly cover the key features of HACC and plans for its future development, focusing on computational, algorithmic, and physics advances, in-situ analysis, and resilience features, while emphasizing the associated computer science needs.

只提供摘要形式。对天空的深入和广泛的调查导致了宇宙学中一系列引人注目的发现。随着调查量变得如此之大，统计上的不确定性几乎消失，宇宙模型必须达到前所未有的规模和精度水平，才能正确地解释观测结果。我将描述所涉及的关键科学问题和问题，然后介绍HACC(硬件/混合加速宇宙学代码)框架，该框架是围绕便携式基于粒子的仿真模型设计的，用于所需的高动态范围应用。我将简要介绍HACC的主要特性及其未来发展计划，重点是计算、算法和物理进展、原位分析和弹性特性，同时强调相关的计算机科学需求。

引用次数: 3