CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI:10.1145/2749469.2750418

Shin-Ying Lee, A. Arunkumar, Carole-Jean Wu

{"title":"CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads","authors":"Shin-Ying Lee, A. Arunkumar, Carole-Jean Wu","doi":"10.1145/2749469.2750418","DOIUrl":null,"url":null,"abstract":"The ubiquity of graphics processing unit (GPU) architectures has made them efficient alternatives to chipmultiprocessors for parallel workloads. GPUs achieve superior performance by making use of massive multi-threading and fast context-switching to hide pipeline stalls and memory access latency. However, recent characterization results have shown that general purpose GPU (GPGPU) applications commonly encounter long stall latencies that cannot be easily hidden with the large number of concurrent threads/warps. This results in varying execution time disparity between different parallel warps, hurting the overall performance of GPUs - the warp criticality problem. To tackle the warp criticality problem, we propose a coordinated solution, criticality-aware warp acceleration (CAWA), that efficiently manages compute and memory resources to accelerate the critical warp execution. Specifically, we design (1) an instruction-based and stall-based criticality predictor to identify the critical warp in a thread-block, (2) a criticality-aware warp scheduler that preferentially allocates more time resources to the critical warp, and (3) a criticality-aware cache reuse predictor that assists critical warp acceleration by retaining latency-critical and useful cache blocks in the L1 data cache. CAWA targets to remove the significant execution time disparity in order to improve resource utilization for GPGPU workloads. Our evaluation results show that, under the proposed coordinated scheduler and cache prioritization management scheme, the performance of the GPGPU workloads can be improved by 23% while other state-of-the-art schedulers, GTO and 2-level schedulers, improve performance by 16% and -2% respectively.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"34 1","pages":"515-527"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"83","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2749469.2750418","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 83

Abstract

The ubiquity of graphics processing unit (GPU) architectures has made them efficient alternatives to chipmultiprocessors for parallel workloads. GPUs achieve superior performance by making use of massive multi-threading and fast context-switching to hide pipeline stalls and memory access latency. However, recent characterization results have shown that general purpose GPU (GPGPU) applications commonly encounter long stall latencies that cannot be easily hidden with the large number of concurrent threads/warps. This results in varying execution time disparity between different parallel warps, hurting the overall performance of GPUs - the warp criticality problem. To tackle the warp criticality problem, we propose a coordinated solution, criticality-aware warp acceleration (CAWA), that efficiently manages compute and memory resources to accelerate the critical warp execution. Specifically, we design (1) an instruction-based and stall-based criticality predictor to identify the critical warp in a thread-block, (2) a criticality-aware warp scheduler that preferentially allocates more time resources to the critical warp, and (3) a criticality-aware cache reuse predictor that assists critical warp acceleration by retaining latency-critical and useful cache blocks in the L1 data cache. CAWA targets to remove the significant execution time disparity in order to improve resource utilization for GPGPU workloads. Our evaluation results show that, under the proposed coordinated scheduler and cache prioritization management scheme, the performance of the GPGPU workloads can be improved by 23% while other state-of-the-art schedulers, GTO and 2-level schedulers, improve performance by 16% and -2% respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

协调warp调度和缓存优先级，用于GPGPU工作负载的关键warp加速

无处不在的图形处理单元(GPU)架构使它们成为并行工作负载的芯片多处理器的有效替代品。gpu通过使用大规模多线程和快速上下文切换来隐藏管道停滞和内存访问延迟，从而实现卓越的性能。然而，最近的表征结果表明，通用GPU (GPGPU)应用程序通常会遇到长时间的失速延迟，这种延迟不能轻易地被大量并发线程/扭曲所掩盖。这导致不同并行翘曲之间的执行时间差异不同，损害gpu的整体性能-翘曲临界问题。为了解决翘曲临界问题，我们提出了一种协调的解决方案——临界感知翘曲加速(CAWA)，它有效地管理计算和内存资源，以加速临界翘曲的执行。具体来说，我们设计了(1)一个基于指令和基于停顿的临界预测器来识别线程块中的临界扭曲，(2)一个临界感知的扭曲调度器，优先为临界扭曲分配更多的时间资源，以及(3)一个临界感知的缓存重用预测器，通过在L1数据缓存中保留延迟关键和有用的缓存块来帮助临界扭曲加速。CAWA的目标是消除显著的执行时间差异，以提高GPGPU工作负载的资源利用率。我们的评估结果表明，在提出的协调调度器和缓存优先级管理方案下，GPGPU工作负载的性能可以提高23%，而其他最先进的调度器，GTO和2级调度器的性能分别提高16%和-2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量

期刊最新文献

Redundant Memory Mappings for fast access to large memories Multiple Clone Row DRAM: A low latency and area optimized DRAM Manycore Network Interfaces for in-memory rack-scale computing Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures ShiDianNao: Shifting vision processing closer to the sensor