Proceedings of the 2016 International Conference on Supercomputing最新文献

Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures 基于时间戳的高效多核缓存一致性协议

Proceedings of the 2016 International Conference on Supercomputing

Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926270

Yuan Yao, Guanhua Wang, Zhiguo Ge, T. Mitra, Wenzhi Chen, Naxin Zhang

As we enter the era of many-core, providing the shared memory abstraction through cache coherence has become progressively difficult. The de-facto standard directory-based cache coherence has been extensively studied; but it does not scale well with increasing core count. Timestamp-based hardware coherence protocols introduced recently offer an attractive alternative solution. In this paper, we propose a timestamp-based coherence protocol, called TC-Release++, that addresses the scalability issues of efficiently supporting cache coherence in large-scale systems. Our approach is inspired by TC-Weak, a recently proposed timestamp-based coherence protocol targeting GPU architectures. We first design TC-Release coherence in an attempt to straightforwardly port TC-Weak to general-purpose many-cores. But re-purposing TC-Weak for general-purpose many-core architectures is challenging due to significant differences both in architecture and the programming model. Indeed the performance of TC-Release turns out to be worse than conventional directory coherence protocols. We overcome the limitations and overheads of TC-Release by introducing simple hardware support to eliminate frequent memory stalls, and an optimized life-time prediction mechanism to improve cache performance. The resulting optimized coherence protocol TC-Release++ is highly scalable (overhead for coherence per last-level cache line scales logarithmically with core count as opposed to linearly for directory coherence) and shows better execution time (3.0%) and comparable network traffic (within 1.3%) relative to the baseline MESI directory coherence protocol.

随着我们进入多核时代，通过缓存一致性提供共享内存抽象变得越来越困难。事实上标准的基于目录的缓存一致性已经被广泛研究;但随着核心数量的增加，它并不能很好地扩展。最近引入的基于时间戳的硬件一致性协议提供了一个有吸引力的替代解决方案。在本文中，我们提出了一个基于时间戳的一致性协议，称为TC-Release++，它解决了在大规模系统中有效支持缓存一致性的可扩展性问题。我们的方法受到TC-Weak的启发，TC-Weak是最近提出的一种针对GPU架构的基于时间戳的相干协议。我们首先设计TC-Release一致性，试图直接将TC-Weak移植到通用多核。但是，由于体系结构和编程模型的显著差异，将TC-Weak重新用于通用多核体系结构是具有挑战性的。事实上，TC-Release的性能比传统的目录一致性协议更差。我们通过引入简单的硬件支持来消除频繁的内存停顿，以及优化的生命周期预测机制来提高缓存性能，从而克服了TC-Release的限制和开销。最终优化的一致性协议TC-Release++具有高度的可扩展性(每个最后一级缓存线的一致性开销随着核心数的增加呈对数增长，而不是目录一致性的线性增长)，并且相对于基线MESI目录一致性协议显示出更好的执行时间(3.0%)和可比较的网络流量(在1.3%以内)。

{"title":"Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures","authors":"Yuan Yao, Guanhua Wang, Zhiguo Ge, T. Mitra, Wenzhi Chen, Naxin Zhang","doi":"10.1145/2925426.2926270","DOIUrl":"https://doi.org/10.1145/2925426.2926270","url":null,"abstract":"As we enter the era of many-core, providing the shared memory abstraction through cache coherence has become progressively difficult. The de-facto standard directory-based cache coherence has been extensively studied; but it does not scale well with increasing core count. Timestamp-based hardware coherence protocols introduced recently offer an attractive alternative solution. In this paper, we propose a timestamp-based coherence protocol, called TC-Release++, that addresses the scalability issues of efficiently supporting cache coherence in large-scale systems. Our approach is inspired by TC-Weak, a recently proposed timestamp-based coherence protocol targeting GPU architectures. We first design TC-Release coherence in an attempt to straightforwardly port TC-Weak to general-purpose many-cores. But re-purposing TC-Weak for general-purpose many-core architectures is challenging due to significant differences both in architecture and the programming model. Indeed the performance of TC-Release turns out to be worse than conventional directory coherence protocols. We overcome the limitations and overheads of TC-Release by introducing simple hardware support to eliminate frequent memory stalls, and an optimized life-time prediction mechanism to improve cache performance. The resulting optimized coherence protocol TC-Release++ is highly scalable (overhead for coherence per last-level cache line scales logarithmically with core count as opposed to linearly for directory coherence) and shows better execution time (3.0%) and comparable network traffic (within 1.3%) relative to the baseline MESI directory coherence protocol.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"277 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123065437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks Proteus:利用深度神经网络的数值精度可变性

Proceedings of the 2016 International Conference on Supercomputing

Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926294

Patrick Judd, Jorge Albericio, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos

This work exploits the tolerance of Deep Neural Networks (DNNs) to reduced precision numerical representations and specifically, their recently demonstrated ability to tolerate representations of different precision per layer while maintaining accuracy. This flexibility enables improvements over conventional DNN implementations that use a single, uniform representation. This work proposes Proteus, which reduces the data traffic and storage footprint needed by DNNs, resulting in reduced energy and improved area efficiency for DNN implementations. Proteus uses a different representation per layer for both the data (neurons) and the weights (synapses) processed by DNNs. Proteus is a layered extension over existing DNN implementations that converts between the numerical representation used by the DNN execution engines and the shorter, layer-specific fixed-point representation used when reading and writing data values to memory be it on-chip buffers or off-chip memory. Proteus uses a novel memory layout for DNN data, enabling a simple, low-cost and low-energy conversion unit. We evaluate Proteus as an extension to a state-of-the-art accelerator [7] which uses a uniform 16-bit fixed-point representation. On five popular DNNs Proteus reduces data traffic among layers by 43% on average while maintaining accuracy within 1% even when compared to a single precision floating-point implementation. As a result, Proteus improves energy by 15% with no performance loss. Proteus also reduces the data footprint by at least 38% and hence the amount of on-chip buffering needed resulting in an implementation that requires 20% less area overall. This area savings can be used to improve cost by building smaller chips, to process larger DNNs for the same on-chip area, or to incorporate an additional three execution engines increasing peak performance bandwidth by 18%.

这项工作利用深度神经网络(dnn)的容忍度来降低精度的数值表示，特别是，它们最近证明了在保持精度的同时容忍每层不同精度的表示的能力。这种灵活性可以改进使用单一统一表示的传统DNN实现。这项工作提出了Proteus，它减少了DNN所需的数据流量和存储空间，从而降低了DNN实现的能量和提高了区域效率。Proteus对dnn处理的数据(神经元)和权重(突触)每层使用不同的表示。Proteus是对现有DNN实现的分层扩展，它在DNN执行引擎使用的数值表示和在向内存(片上缓冲区或片外内存)读写数据值时使用的更短的、特定于层的定点表示之间进行转换。Proteus采用新颖的DNN数据存储布局，实现简单、低成本和低能量的转换单元。我们评估Proteus作为最先进的加速器[7]的扩展，[7]使用统一的16位定点表示。在五种流行的dnn上，Proteus平均减少了43%的层间数据流量，即使与单精度浮点实现相比，准确率也保持在1%以内。结果，Proteus在没有性能损失的情况下提高了15%的能量。Proteus还减少了至少38%的数据占用，因此所需的片上缓冲量使实现所需的总体面积减少了20%。节省的面积可用于通过构建更小的芯片来提高成本，在相同的片上区域处理更大的dnn，或者合并额外的三个执行引擎，将峰值性能带宽提高18%。

{"title":"Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks","authors":"Patrick Judd, Jorge Albericio, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos","doi":"10.1145/2925426.2926294","DOIUrl":"https://doi.org/10.1145/2925426.2926294","url":null,"abstract":"This work exploits the tolerance of Deep Neural Networks (DNNs) to reduced precision numerical representations and specifically, their recently demonstrated ability to tolerate representations of different precision per layer while maintaining accuracy. This flexibility enables improvements over conventional DNN implementations that use a single, uniform representation. This work proposes Proteus, which reduces the data traffic and storage footprint needed by DNNs, resulting in reduced energy and improved area efficiency for DNN implementations. Proteus uses a different representation per layer for both the data (neurons) and the weights (synapses) processed by DNNs. Proteus is a layered extension over existing DNN implementations that converts between the numerical representation used by the DNN execution engines and the shorter, layer-specific fixed-point representation used when reading and writing data values to memory be it on-chip buffers or off-chip memory. Proteus uses a novel memory layout for DNN data, enabling a simple, low-cost and low-energy conversion unit. We evaluate Proteus as an extension to a state-of-the-art accelerator [7] which uses a uniform 16-bit fixed-point representation. On five popular DNNs Proteus reduces data traffic among layers by 43% on average while maintaining accuracy within 1% even when compared to a single precision floating-point implementation. As a result, Proteus improves energy by 15% with no performance loss. Proteus also reduces the data footprint by at least 38% and hence the amount of on-chip buffering needed resulting in an implementation that requires 20% less area overall. This area savings can be used to improve cost by building smaller chips, to process larger DNNs for the same on-chip area, or to incorporate an additional three execution engines increasing peak performance bandwidth by 18%.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122685406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 95

SReplay: Deterministic Sub-Group Replay for One-Sided Communication 单侧通信的确定性子组重放

Proceedings of the 2016 International Conference on Supercomputing

Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926264

Xuehai Qian, Koushik Sen, Paul H. Hargrove, Costin Iancu

Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-to-date, there is no existing deterministic replay solution for one-sided communication. The essential problem is that the readers of updated data do not have any information on which remote threads produced the updates, the conventional happens-before based ordering tracking techniques are challenging to work at scale. This paper presents SReplay, the first software tool for sub-group deterministic record and replay for one-sided communication. SReplay allows the user to specify and record the execution of a set of threads of interest (sub-group), and then deterministically replays the execution of the sub-group on a local machine without starting the remaining threads. SReplay ensures sub-group determinism using a hybrid data- and order-replay technique. SReplay maintains scalability by a combination of local logging and approximative event order tracking within sub-group. Our evaluation on deterministic and nondeterministic UPC programs shows that SReplay introduces an overhead ranging from 1.3x to 29x, when running on 1,024 cores and tracking up to 16 threads.

HPC调试器和弹性机制需要并行执行的重播。到目前为止，还没有针对单边通信的确定性重放解决方案。关键的问题是，更新数据的读取器没有任何关于哪个远程线程产生了更新的信息，传统的基于happens-before的排序跟踪技术很难大规模地工作。本文介绍了第一个用于单侧通信的子群确定性记录和重放的软件工具SReplay。SReplay允许用户指定并记录一组感兴趣的线程(子组)的执行，然后确定地在本地机器上重放子组的执行，而不启动剩余的线程。SReplay使用混合数据和顺序重放技术确保子组确定性。SReplay通过结合本地日志记录和子组内近似事件顺序跟踪来保持可伸缩性。我们对确定性和非确定性UPC程序的评估表明，当运行在1024个内核上并跟踪最多16个线程时，SReplay引入的开销从1.3倍到29倍不等。

{"title":"SReplay: Deterministic Sub-Group Replay for One-Sided Communication","authors":"Xuehai Qian, Koushik Sen, Paul H. Hargrove, Costin Iancu","doi":"10.1145/2925426.2926264","DOIUrl":"https://doi.org/10.1145/2925426.2926264","url":null,"abstract":"Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-to-date, there is no existing deterministic replay solution for one-sided communication. The essential problem is that the readers of updated data do not have any information on which remote threads produced the updates, the conventional happens-before based ordering tracking techniques are challenging to work at scale. This paper presents SReplay, the first software tool for sub-group deterministic record and replay for one-sided communication. SReplay allows the user to specify and record the execution of a set of threads of interest (sub-group), and then deterministically replays the execution of the sub-group on a local machine without starting the remaining threads. SReplay ensures sub-group determinism using a hybrid data- and order-replay technique. SReplay maintains scalability by a combination of local logging and approximative event order tracking within sub-group. Our evaluation on deterministic and nondeterministic UPC programs shows that SReplay introduces an overhead ranging from 1.3x to 29x, when running on 1,024 cores and tracking up to 16 threads.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126769754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications SARVAVID:一种用于开发可扩展计算基因组学应用的领域特定语言

Proceedings of the 2016 International Conference on Supercomputing

Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926283

K. Mahadik, Christopher Wright, Jinyi Zhang, Milind Kulkarni, S. Bagchi, S. Chaterji

Breakthroughs in gene sequencing technologies have led to an exponential increase in the amount of genomic data. Efficient tools to rapidly process such large quantities of data are critical in the study of gene functions, diseases, evolution, and population variation. These tools are designed in an ad-hoc manner, and require extensive programmer effort to develop and optimize them. Often, such tools are written with the currently available data sizes in mind, and soon start to under perform due to the exponential growth in data. Furthermore, to obtain high-performance, these tools require parallel implementations, adding to the development complexity. This paper makes an observation that most such tools contain a recurring set of software modules, or kernels. The availability of efficient implementations of such kernels can improve programmer productivity, and provide effective scalability with growing data. To achieve this goal, the paper presents a domain-specific language, called Sarvavid, which provides these kernels as language constructs. Sarvavid comes with a compiler that performs domain-specific optimizations, which are beyond the scope of libraries and generic compilers. Furthermore, Sarvavid inherently supports exploitation of parallelism across multiple nodes. To demonstrate the efficacy of Sarvavid, we implement five well-known genomics applications---BLAST, MUMmer, E-MEM, SPAdes, and SGA---using Sarvavid. Our versions of BLAST, MUMmer, and E-MEM show a speedup of 2.4X, 2.5X, and 2.1X respectively compared to hand-optimized implementations when run on a single node, while SPAdes and SGA show the same performance as hand-written code. Moreover, Sarvavid applications scale to 1024 cores using a Hadoop backend.

基因测序技术的突破导致了基因组数据量的指数级增长。快速处理如此大量数据的有效工具对于基因功能、疾病、进化和种群变异的研究至关重要。这些工具是以特别的方式设计的，需要程序员付出大量的努力来开发和优化它们。通常，这样的工具是在考虑当前可用数据大小的情况下编写的，并且由于数据的指数级增长，很快就开始表现不佳。此外，为了获得高性能，这些工具需要并行实现，这增加了开发的复杂性。本文观察到，大多数此类工具都包含一组反复出现的软件模块或内核。这种内核的有效实现的可用性可以提高程序员的工作效率，并为不断增长的数据提供有效的可伸缩性。为了实现这一目标，本文提出了一种特定于领域的语言，称为Sarvavid，它将这些内核作为语言结构提供。Sarvavid附带了一个编译器，可以执行特定于领域的优化，这超出了库和通用编译器的范围。此外，Sarvavid本身就支持跨多个节点的并行性。为了证明Sarvavid的有效性，我们使用Sarvavid实现了五个著名的基因组学应用程序-BLAST, MUMmer, E-MEM, SPAdes和SGA。在单个节点上运行时，我们的BLAST、MUMmer和E-MEM版本分别比手工优化的实现加速2.4倍、2.5倍和2.1倍，而SPAdes和SGA的性能与手工编写的代码相同。此外，使用Hadoop后端，Sarvavid应用程序可以扩展到1024核。

{"title":"SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications","authors":"K. Mahadik, Christopher Wright, Jinyi Zhang, Milind Kulkarni, S. Bagchi, S. Chaterji","doi":"10.1145/2925426.2926283","DOIUrl":"https://doi.org/10.1145/2925426.2926283","url":null,"abstract":"Breakthroughs in gene sequencing technologies have led to an exponential increase in the amount of genomic data. Efficient tools to rapidly process such large quantities of data are critical in the study of gene functions, diseases, evolution, and population variation. These tools are designed in an ad-hoc manner, and require extensive programmer effort to develop and optimize them. Often, such tools are written with the currently available data sizes in mind, and soon start to under perform due to the exponential growth in data. Furthermore, to obtain high-performance, these tools require parallel implementations, adding to the development complexity. This paper makes an observation that most such tools contain a recurring set of software modules, or kernels. The availability of efficient implementations of such kernels can improve programmer productivity, and provide effective scalability with growing data. To achieve this goal, the paper presents a domain-specific language, called Sarvavid, which provides these kernels as language constructs. Sarvavid comes with a compiler that performs domain-specific optimizations, which are beyond the scope of libraries and generic compilers. Furthermore, Sarvavid inherently supports exploitation of parallelism across multiple nodes. To demonstrate the efficacy of Sarvavid, we implement five well-known genomics applications---BLAST, MUMmer, E-MEM, SPAdes, and SGA---using Sarvavid. Our versions of BLAST, MUMmer, and E-MEM show a speedup of 2.4X, 2.5X, and 2.1X respectively compared to hand-optimized implementations when run on a single node, while SPAdes and SGA show the same performance as hand-written code. Moreover, Sarvavid applications scale to 1024 cores using a Hadoop backend.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126412427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

SFU-Driven Transparent Approximation Acceleration on GPUs gpu上的sfu驱动的透明近似加速

Proceedings of the 2016 International Conference on Supercomputing

Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926255

Ang Li, S. Song, M. Wijtvliet, Akash Kumar, H. Corporaal

Approximate computing, the technique that sacrifices certain amount of accuracy in exchange for substantial performance boost or power reduction, is one of the most promising solutions to enable power control and performance scaling towards exascale. Although most existing approximation designs target the emerging data-intensive applications that are comparatively more error-tolerable, there is still high demand for the acceleration of traditional scientific applications (e.g., weather and nuclear simulation), which often comprise intensive transcendental function calls and are very sensitive to accuracy loss. To address this challenge, we focus on a very important but long ignored approximation unit on today's commercial GPUs --- the special-function unit (SFU), and clarify its unique role in performance acceleration of accuracy-sensitive applications in the context of approximate computing. To better understand its features, we conduct a thorough empirical analysis on three generations of NVIDIA GPU architectures to evaluate all the single-precision and double-precision numeric transcendental functions that can be accelerated by SFUs, in terms of their performance, accuracy and power consumption. Based on the insights from the evaluation, we propose a transparent, tractable and portable design framework for SFU-driven approximate acceleration on GPUs. Our design is software-based and requires no hardware or application modifications. Experimental results on three NVIDIA GPU platforms demonstrate that our proposed framework can provide fine-grained tuning for performance and accuracy trade-offs, thus facilitating applications to achieve the maximum performance under certain accuracy constraints.

近似计算(Approximate computing)是一种牺牲一定精度以换取大幅性能提升或功耗降低的技术，是实现功率控制和性能向百亿亿级扩展的最有前途的解决方案之一。尽管大多数现有的近似设计针对的是相对更容易容忍错误的新兴数据密集型应用，但对传统科学应用(例如天气和核模拟)的加速仍然有很高的要求，这些应用通常包含密集的超越函数调用，并且对精度损失非常敏感。为了应对这一挑战，我们将重点放在当今商用gpu上一个非常重要但长期被忽视的近似单元——特殊功能单元(SFU)上，并阐明其在近似计算背景下对精度敏感的应用程序的性能加速中的独特作用。为了更好地了解其特性，我们对三代NVIDIA GPU架构进行了全面的实证分析，以评估sfu可以加速的所有单精度和双精度数值超越函数的性能，精度和功耗。基于评估的见解，我们提出了一个透明、易于处理和便携的设计框架，用于gpu上的sfu驱动的近似加速。我们的设计是基于软件的，不需要硬件或应用程序的修改。在三个NVIDIA GPU平台上的实验结果表明，我们提出的框架可以为性能和精度之间的权衡提供细粒度的调整，从而促进应用程序在一定的精度约束下实现最大性能。

{"title":"SFU-Driven Transparent Approximation Acceleration on GPUs","authors":"Ang Li, S. Song, M. Wijtvliet, Akash Kumar, H. Corporaal","doi":"10.1145/2925426.2926255","DOIUrl":"https://doi.org/10.1145/2925426.2926255","url":null,"abstract":"Approximate computing, the technique that sacrifices certain amount of accuracy in exchange for substantial performance boost or power reduction, is one of the most promising solutions to enable power control and performance scaling towards exascale. Although most existing approximation designs target the emerging data-intensive applications that are comparatively more error-tolerable, there is still high demand for the acceleration of traditional scientific applications (e.g., weather and nuclear simulation), which often comprise intensive transcendental function calls and are very sensitive to accuracy loss. To address this challenge, we focus on a very important but long ignored approximation unit on today's commercial GPUs --- the special-function unit (SFU), and clarify its unique role in performance acceleration of accuracy-sensitive applications in the context of approximate computing. To better understand its features, we conduct a thorough empirical analysis on three generations of NVIDIA GPU architectures to evaluate all the single-precision and double-precision numeric transcendental functions that can be accelerated by SFUs, in terms of their performance, accuracy and power consumption. Based on the insights from the evaluation, we propose a transparent, tractable and portable design framework for SFU-driven approximate acceleration on GPUs. Our design is software-based and requires no hardware or application modifications. Experimental results on three NVIDIA GPU platforms demonstrate that our proposed framework can provide fine-grained tuning for performance and accuracy trade-offs, thus facilitating applications to achieve the maximum performance under certain accuracy constraints.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"251 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125711365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Graph Prefetching Using Data Structure Knowledge 使用数据结构知识的图预取

Proceedings of the 2016 International Conference on Supercomputing

Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926254

S. Ainsworth, Timothy M. Jones

Searches on large graphs are heavily memory latency bound, as a result of many high latency DRAM accesses. Due to the highly irregular nature of the access patterns involved, caches and prefetchers, both hardware and software, perform poorly on graph workloads. This leads to CPU stalling for the majority of the time. However, in many cases the data access pattern is well defined and predictable in advance, many falling into a small set of simple patterns. Although existing implicit prefetchers cannot bring significant benefit, a prefetcher armed with knowledge of the data structures and access patterns could accurately anticipate applications' traversals to bring in the appropriate data. This paper presents a design of an explicitly configured prefetcher to improve performance for breadth-first searches and sequential iteration on the efficient and commonly-used compressed sparse row graph format. By snooping L1 cache accesses from the core and reacting to data returned from its own prefetches, the prefetcher can schedule timely loads of data in advance of the application needing it. For a range of applications and graph sizes, our prefetcher achieves average speedups of 2.3x, and up to 3.3x, with little impact on memory bandwidth requirements.

由于许多高延迟的DRAM访问，对大型图的搜索受到内存延迟的严重限制。由于所涉及的访问模式的高度不规则性，缓存和预取器(硬件和软件)在图形工作负载上的性能很差。这将导致CPU在大部分时间内停滞不前。但是，在许多情况下，数据访问模式是预先定义好的和可预测的，其中许多模式属于一小组简单模式。尽管现有的隐式预取器不能带来显著的好处，但是掌握了数据结构和访问模式知识的预取器可以准确地预测应用程序的遍历，从而引入适当的数据。本文提出了一种显式配置的预取器设计，以提高在高效且常用的压缩稀疏行图格式上的宽度优先搜索和顺序迭代性能。通过窥探来自核心的L1缓存访问，并对自己的预取返回的数据做出反应，预取器可以在应用程序需要数据之前调度数据的及时加载。对于一系列应用程序和图形大小，我们的预取器实现了2.3倍的平均速度，最高可达3.3倍，对内存带宽需求的影响很小。

{"title":"Graph Prefetching Using Data Structure Knowledge","authors":"S. Ainsworth, Timothy M. Jones","doi":"10.1145/2925426.2926254","DOIUrl":"https://doi.org/10.1145/2925426.2926254","url":null,"abstract":"Searches on large graphs are heavily memory latency bound, as a result of many high latency DRAM accesses. Due to the highly irregular nature of the access patterns involved, caches and prefetchers, both hardware and software, perform poorly on graph workloads. This leads to CPU stalling for the majority of the time. However, in many cases the data access pattern is well defined and predictable in advance, many falling into a small set of simple patterns. Although existing implicit prefetchers cannot bring significant benefit, a prefetcher armed with knowledge of the data structures and access patterns could accurately anticipate applications' traversals to bring in the appropriate data. This paper presents a design of an explicitly configured prefetcher to improve performance for breadth-first searches and sequential iteration on the efficient and commonly-used compressed sparse row graph format. By snooping L1 cache accesses from the core and reacting to data returned from its own prefetches, the prefetcher can schedule timely loads of data in advance of the application needing it. For a range of applications and graph sizes, our prefetcher achieves average speedups of 2.3x, and up to 3.3x, with little impact on memory bandwidth requirements.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124695686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

Scheduling Tasks with Mixed Timing Constraints in GPU-Powered Real-Time Systems gpu驱动的实时系统中具有混合时间约束的调度任务

Proceedings of the 2016 International Conference on Supercomputing

Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926265

Yunlong Xu, Rui Wang, Tao Li, Mingcong Song, Lan Gao, Zhongzhi Luan, D. Qian

Due to the cost-effective, massive computational power of graphics processing units (GPUs), there is a growing interest of utilizing GPUs in real-time systems. For example GPUs have been applied to automotive systems to enable new advanced and intelligent driver assistance technologies, accelerating the path to self-driving cars. In such systems, GPUs are shared among tasks with mixed timing constraints: real-time (RT) tasks that have to be accomplished before specified deadlines, and non-real-time, best-effort (BE) tasks. In this paper, (1) we propose resource-aware non-uniform slack distribution to enhance the schedulability of RT tasks (the total amount of work of RT tasks whose deadlines can be satisfied on a given amount of resources) in GPU-enabled systems; (2) we propose deadline-aware dynamic GPU partitioning to allow RT and BE tasks to run on a GPU simultaneously, such that BE tasks are not blocked for a long time. We evaluate the effectiveness of the proposed approaches by using both synthetic benchmarks and a real-world workload that consists of a set of emerging automotive tasks. Experimental results show that the proposed approaches yield significant schedulability improvement for RT tasks and turnaround time decrement for BE tasks. Moreover, the analysis of two driving scenarios shows that such schedulability improvement and turnaround time decrement can significantly enhance the driving safety and experience. For example, when the resource-aware non-uniform slack distribution approach is used, the distance that a car travels during the time between a traffic sign (pedestrian) is "seen and recognized" is decreased from 44.4m to 22.2m (from 4.4m to 2.2m); when the deadline-aware dynamic GPU partitioning approach is used, the distance that the car has traveled before a drowsy driver is woken up is reduced from 56.2m to 29.2m.

由于图形处理单元(gpu)的成本效益高，计算能力强，因此在实时系统中使用gpu的兴趣越来越大。例如，gpu已应用于汽车系统，以实现新的先进智能驾驶辅助技术，加速自动驾驶汽车的发展。在这样的系统中，gpu在具有混合时间约束的任务之间共享:必须在指定截止日期之前完成的实时(RT)任务和非实时的尽力而为(be)任务。在本文中，(1)我们提出了资源感知的非均匀松弛分布，以提高gpu支持系统中RT任务的可调度性(在给定资源数量上完成的RT任务的总工作量);(2)我们提出了截止时间感知的动态GPU分区，允许RT和BE任务同时在GPU上运行，这样BE任务就不会被长时间阻塞。我们通过使用合成基准和由一组新兴汽车任务组成的真实工作负载来评估所提出方法的有效性。实验结果表明，该方法显著提高了RT任务的可调度性，减少了BE任务的周转时间。此外，对两种驾驶场景的分析表明，提高可调度性和减少周转时间可以显著提高驾驶安全性和体验性。例如，当采用资源感知非均匀松弛分配方法时，汽车在“看到并识别”交通标志(行人)的时间内行驶距离从44.4m减少到22.2m(从4.4m减少到2.2m);当使用截止时间感知的动态GPU分区方法时，汽车在昏昏欲睡的驾驶员被唤醒之前行驶的距离从56.2m减少到29.2m。

{"title":"Scheduling Tasks with Mixed Timing Constraints in GPU-Powered Real-Time Systems","authors":"Yunlong Xu, Rui Wang, Tao Li, Mingcong Song, Lan Gao, Zhongzhi Luan, D. Qian","doi":"10.1145/2925426.2926265","DOIUrl":"https://doi.org/10.1145/2925426.2926265","url":null,"abstract":"Due to the cost-effective, massive computational power of graphics processing units (GPUs), there is a growing interest of utilizing GPUs in real-time systems. For example GPUs have been applied to automotive systems to enable new advanced and intelligent driver assistance technologies, accelerating the path to self-driving cars. In such systems, GPUs are shared among tasks with mixed timing constraints: real-time (RT) tasks that have to be accomplished before specified deadlines, and non-real-time, best-effort (BE) tasks. In this paper, (1) we propose resource-aware non-uniform slack distribution to enhance the schedulability of RT tasks (the total amount of work of RT tasks whose deadlines can be satisfied on a given amount of resources) in GPU-enabled systems; (2) we propose deadline-aware dynamic GPU partitioning to allow RT and BE tasks to run on a GPU simultaneously, such that BE tasks are not blocked for a long time. We evaluate the effectiveness of the proposed approaches by using both synthetic benchmarks and a real-world workload that consists of a set of emerging automotive tasks. Experimental results show that the proposed approaches yield significant schedulability improvement for RT tasks and turnaround time decrement for BE tasks. Moreover, the analysis of two driving scenarios shows that such schedulability improvement and turnaround time decrement can significantly enhance the driving safety and experience. For example, when the resource-aware non-uniform slack distribution approach is used, the distance that a car travels during the time between a traffic sign (pedestrian) is \"seen and recognized\" is decreased from 44.4m to 22.2m (from 4.4m to 2.2m); when the deadline-aware dynamic GPU partitioning approach is used, the distance that the car has traveled before a drowsy driver is woken up is reduced from 56.2m to 29.2m.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131952484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Tag-Split Cache for Efficient GPGPU Cache Utilization 标签分割缓存高效GPGPU缓存利用率

Proceedings of the 2016 International Conference on Supercomputing

Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926253

Lingda Li, Ari B. Hayes, S. Song, E. Zhang

Modern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using smaller cache lines could improve cache space utilization, but it also frequently suffers from significant performance loss by introducing large amount of extra cache requests. In this work, we propose a novel cache design named tag-split cache (TSC) that enables fine-grained cache storage to address the problem of cache space underutilization while keeping memory request number unchanged. TSC divides tag into two parts to reduce storage overhead, and it supports multiple cache line replacement in one cycle. TSC can also automatically adjust cache storage granularity to avoid performance loss for applications with good spatial locality. Our evaluation shows that TSC improves the baseline cache performance by 17.2% on average across a wide range of applications. It also out-performs other previous techniques significantly.

现代gpu采用缓存来提高内存系统的效率。然而，由于不规则的内存访问和较差的空间局部性，大量的缓存空间未被充分利用，这在GPU应用中很常见。我们的实验表明，使用较小的缓存线可以提高缓存空间的利用率，但是由于引入了大量额外的缓存请求，它也经常遭受显著的性能损失。在这项工作中，我们提出了一种新的缓存设计，称为标签分割缓存(TSC)，它使细粒度缓存存储能够解决缓存空间利用率不足的问题，同时保持内存请求数不变。TSC将标签分成两部分，以减少存储开销，并支持在一个周期内更换多条缓存线。TSC还可以自动调整缓存存储粒度，以避免具有良好空间局部性的应用程序的性能损失。我们的评估表明，在广泛的应用程序中，TSC将基准缓存性能平均提高了17.2%。它的性能也明显优于以前的其他技术。

引用次数: 11

Origami: Folding Warps for Energy Efficient GPUs 折纸:节能gpu的折叠翘曲

Proceedings of the 2016 International Conference on Supercomputing

Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926281

Mohammad Abdel-Majeed, Daniel Wong, Justin Kuang, M. Annavaram

Graphical processing units (GPUs) are increasingly used to run a wide range of general purpose applications. Due to wide variation in application parallelism and inherent application level inefficiencies, GPUs experience significant idle periods. In this work, we first show that significant fine-grain pipeline bubbles exist regardless of warp scheduling policies or workloads. We propose to convert these bubbles into energy saving opportunities using Origami. Origami consists of two components: Warp Folding and the Origami scheduler. With Warp Folding, warps are split into two half-warps which are issued in succession. Warp Folding leaves half of the execution lanes idle, which is then exploited to improve energy efficiency through power gating. Origami scheduler is a new warp scheduler that is cognizant of the Warp Folding process and tries to further extend the sleep times of idle execution lanes. By combining the two techniques Origami can save 49% and 46% of the leakage energy in the integer and floating point pipelines, respectively. These savings are better than or at least on-par with Warped-Gates, a prior power gating technique that power gates the entire cluster of execution lanes. But Origami achieves these energy savings without relying on forcing idleness on execution lanes, which leads to performance losses, as has been proposed in Warped-Gates. Hence, Origami is able to achieve these energy savings with virtually no performance overhead.

图形处理单元(gpu)越来越多地用于运行广泛的通用应用程序。由于应用程序并行性的广泛差异和固有的应用程序级别的低效率，gpu经历了相当长的空闲期。在这项工作中，我们首先表明，无论翘曲调度策略或工作负载如何，都存在显著的细粒度管道气泡。我们建议用折纸将这些气泡转化为节能的机会。折纸由两个组件组成:翘曲折叠和折纸调度程序。经纱折叠时，经纱被分成两个半经纱，这两个半经纱是连续发出的。翘曲折叠使一半的执行通道闲置，然后利用它通过功率门控来提高能源效率。Origami调度器是一种新的调度器，它认识到warp折叠过程，并试图进一步延长空闲执行通道的睡眠时间。通过结合这两种技术，Origami在整数和浮点管道中分别可以节省49%和46%的泄漏能量。这些节省比warp - gates更好，或者至少与warp - gates相当，后者是一种先前的功率门控技术，可以对整个执行通道集群进行功率门控。但是Origami实现了这些节能，而不依赖于强制执行通道上的空闲，这会导致性能损失，正如在warp - gates中提出的那样。因此，Origami能够在几乎没有性能开销的情况下实现这些节能。

{"title":"Origami: Folding Warps for Energy Efficient GPUs","authors":"Mohammad Abdel-Majeed, Daniel Wong, Justin Kuang, M. Annavaram","doi":"10.1145/2925426.2926281","DOIUrl":"https://doi.org/10.1145/2925426.2926281","url":null,"abstract":"Graphical processing units (GPUs) are increasingly used to run a wide range of general purpose applications. Due to wide variation in application parallelism and inherent application level inefficiencies, GPUs experience significant idle periods. In this work, we first show that significant fine-grain pipeline bubbles exist regardless of warp scheduling policies or workloads. We propose to convert these bubbles into energy saving opportunities using Origami. Origami consists of two components: Warp Folding and the Origami scheduler. With Warp Folding, warps are split into two half-warps which are issued in succession. Warp Folding leaves half of the execution lanes idle, which is then exploited to improve energy efficiency through power gating. Origami scheduler is a new warp scheduler that is cognizant of the Warp Folding process and tries to further extend the sleep times of idle execution lanes. By combining the two techniques Origami can save 49% and 46% of the leakage energy in the integer and floating point pipelines, respectively. These savings are better than or at least on-par with Warped-Gates, a prior power gating technique that power gates the entire cluster of execution lanes. But Origami achieves these energy savings without relying on forcing idleness on execution lanes, which leads to performance losses, as has been proposed in Warped-Gates. Hence, Origami is able to achieve these energy savings with virtually no performance overhead.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133116474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration 利用私有局部内存降低加速器集成的机会成本

Proceedings of the 2016 International Conference on Supercomputing

Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926258

E. G. Cota, Paolo Mantovani, L. Carloni

We present Roca, a technique to reduce the opportunity cost of integrating non-programmable, high-throughput accelerators in general-purpose architectures. Roca exploits the insight that non-programmable accelerators are mostly made of private local memories (PLMs), which are key to the accelerators' performance and energy efficiency. Roca transparently exposes PLMs of otherwise unused accelerators to the cache substrate, thereby allowing the system to extract utility from accelerators even when they cannot directly speed up the system's workload. Roca adds low complexity to existing accelerator designs, requires minimal modifications to the cache substrate, and incurs a modest area overhead that is almost entirely due to additional tag storage. We quantify the utility of Roca by comparing the returns of investing area in either regular last-level cache banks or Roca-enabled accelerators. Through simulation of non-accelerated multiprogrammed workloads on a 16-core system, we extend a 2MB S-NUCA baseline system to show that a 6MB Roca-enabled last-level cache built upon typical accelerators (i.e. whose area is 66% memory) can, on average, realize 70% of the performance and 68% of the energy efficiency benefits of a same-area 8MB S-NUCA configuration, in addition to the potential orders-of-magnitude efficiency and performance improvements that the added accelerators provide to workloads suitable for acceleration.

我们提出Roca，一种降低在通用架构中集成非可编程、高吞吐量加速器的机会成本的技术。Roca利用了不可编程加速器主要由私有本地存储器(plm)组成的洞察力，这是加速器性能和能源效率的关键。Roca透明地将未使用加速器的plm暴露给缓存基板，从而允许系统从加速器中提取效用，即使它们不能直接加速系统的工作负载。Roca为现有加速器设计增加了较低的复杂性，只需对缓存基板进行最小的修改，并且几乎完全由于额外的标签存储而产生适度的面积开销。我们通过比较在常规的最后一级缓存库或启用Roca的加速器中投资区域的回报来量化Roca的效用。通过在16核系统上模拟非加速的多程序工作负载，我们扩展了2MB S-NUCA基线系统，以表明在典型加速器(即其面积为66%的内存)上构建的6MB roca支持的最后一级缓存平均可以实现70%的性能和68%的能效效益相同面积8MB S-NUCA配置。除了添加的加速器为适合加速的工作负载提供潜在的数量级效率和性能改进之外。

{"title":"Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration","authors":"E. G. Cota, Paolo Mantovani, L. Carloni","doi":"10.1145/2925426.2926258","DOIUrl":"https://doi.org/10.1145/2925426.2926258","url":null,"abstract":"We present Roca, a technique to reduce the opportunity cost of integrating non-programmable, high-throughput accelerators in general-purpose architectures. Roca exploits the insight that non-programmable accelerators are mostly made of private local memories (PLMs), which are key to the accelerators' performance and energy efficiency. Roca transparently exposes PLMs of otherwise unused accelerators to the cache substrate, thereby allowing the system to extract utility from accelerators even when they cannot directly speed up the system's workload. Roca adds low complexity to existing accelerator designs, requires minimal modifications to the cache substrate, and incurs a modest area overhead that is almost entirely due to additional tag storage. We quantify the utility of Roca by comparing the returns of investing area in either regular last-level cache banks or Roca-enabled accelerators. Through simulation of non-accelerated multiprogrammed workloads on a 16-core system, we extend a 2MB S-NUCA baseline system to show that a 6MB Roca-enabled last-level cache built upon typical accelerators (i.e. whose area is 66% memory) can, on average, realize 70% of the performance and 68% of the energy efficiency benefits of a same-area 8MB S-NUCA configuration, in addition to the potential orders-of-magnitude efficiency and performance improvements that the added accelerators provide to workloads suitable for acceleration.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133955027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8