Barrier-Aware Warp Scheduling for Throughput Processors

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI:10.1145/2925426.2926267

Yuxi Liu, Zhibin Yu, L. Eeckhout, V. Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Chengzhong Xu

{"title":"Barrier-Aware Warp Scheduling for Throughput Processors","authors":"Yuxi Liu, Zhibin Yu, L. Eeckhout, V. Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Chengzhong Xu","doi":"10.1145/2925426.2926267","DOIUrl":null,"url":null,"abstract":"Parallel GPGPU applications rely on barrier synchronization to align thread block activity. Few prior work has studied and characterized barrier synchronization within a thread block and its impact on performance. In this paper, we find that barriers cause substantial stall cycles in barrier-intensive GPGPU applications although GPGPUs employ lightweight hardware-support barriers. To help investigate the reasons, we define the execution between two adjacent barriers of a thread block as a warp-phase. We find that the execution progress within a warp-phase varies dramatically across warps, which we call warp-phase-divergence. While warp-phase-divergence may result from execution time disparity among warps due to differences in application code or input, and/or shared resource contention, we also pinpoint that warp-phase-divergence may result from warp scheduling. To mitigate barrier induced stall cycle inefficiency, we propose barrier-aware warp scheduling (BAWS). It combines two techniques to improve the performance of barrier-intensive GPGPU applications. The first technique, most-waiting-first (MWF), assigns a higher scheduling priority to the warps of a thread block that has a larger number of warps waiting at a barrier. The second technique, critical-fetch-first (CFF), fetches instructions from the warp to be issued by MWF in the next cycle. To evaluate the efficiency of BAWS, we consider 13 barrier-intensive GPGPU applications, and we report that BAWS speeds up performance by 17% and 9% on average (and up to 35% and 30%) over loosely-round-robin (LRR) and greedy-then-oldest (GTO) warp scheduling, respectively. We compare BAWS against recent concurrent work SAWS, finding that BAWS outperforms SAWS by 7% on average and up to 27%. For non-barrier-intensive workloads, we demonstrate that BAWS is performance-neutral compared to GTO and SAWS, while improving performance by 5.7% on average (and up to 22%) compared to LRR. BAWS' hardware cost is limited to 6 bytes per streaming multiprocessor (SM).","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2925426.2926267","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

Parallel GPGPU applications rely on barrier synchronization to align thread block activity. Few prior work has studied and characterized barrier synchronization within a thread block and its impact on performance. In this paper, we find that barriers cause substantial stall cycles in barrier-intensive GPGPU applications although GPGPUs employ lightweight hardware-support barriers. To help investigate the reasons, we define the execution between two adjacent barriers of a thread block as a warp-phase. We find that the execution progress within a warp-phase varies dramatically across warps, which we call warp-phase-divergence. While warp-phase-divergence may result from execution time disparity among warps due to differences in application code or input, and/or shared resource contention, we also pinpoint that warp-phase-divergence may result from warp scheduling. To mitigate barrier induced stall cycle inefficiency, we propose barrier-aware warp scheduling (BAWS). It combines two techniques to improve the performance of barrier-intensive GPGPU applications. The first technique, most-waiting-first (MWF), assigns a higher scheduling priority to the warps of a thread block that has a larger number of warps waiting at a barrier. The second technique, critical-fetch-first (CFF), fetches instructions from the warp to be issued by MWF in the next cycle. To evaluate the efficiency of BAWS, we consider 13 barrier-intensive GPGPU applications, and we report that BAWS speeds up performance by 17% and 9% on average (and up to 35% and 30%) over loosely-round-robin (LRR) and greedy-then-oldest (GTO) warp scheduling, respectively. We compare BAWS against recent concurrent work SAWS, finding that BAWS outperforms SAWS by 7% on average and up to 27%. For non-barrier-intensive workloads, we demonstrate that BAWS is performance-neutral compared to GTO and SAWS, while improving performance by 5.7% on average (and up to 22%) compared to LRR. BAWS' hardware cost is limited to 6 bytes per streaming multiprocessor (SM).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

吞吐量处理器的障碍感知Warp调度

并行GPGPU应用程序依靠屏障同步来对齐线程块活动。先前很少有研究和描述线程块内的屏障同步及其对性能的影响。在本文中，我们发现屏障在屏障密集型GPGPU应用中会导致大量的失速周期，尽管GPGPU采用了轻量级的硬件支持屏障。为了帮助调查原因，我们将线程块的两个相邻屏障之间的执行定义为扭曲阶段。我们发现，在一个翘曲阶段内的执行进度在不同的翘曲阶段之间变化很大，我们称之为翘曲阶段发散。虽然由于应用程序代码或输入的差异和/或共享资源争用导致的warp之间的执行时间差异可能导致warp-phase-divergence，但我们也指出warp-phase-divergence可能导致warp调度。为了减轻障碍引起的失速周期效率低下，我们提出了障碍感知的曲速调度(BAWS)。它结合了两种技术来提高屏障密集型GPGPU应用程序的性能。第一种技术是最等待优先(MWF)，它为线程块的翘曲分配了更高的调度优先级，因为线程块中有更多的翘曲在等待一个屏障。第二种技术是关键先取(critical-fetch-first, CFF)，它从经纱中获取指令，由MWF在下一个循环中发出。为了评估BAWS的效率，我们考虑了13个屏障密集型的GPGPU应用程序，我们报告说，与松散轮询调度(LRR)和贪婪最旧(GTO)的warp调度相比，BAWS的性能平均提高了17%和9%(最高可达35%和30%)。我们将BAWS与最近并发工作的SAWS进行了比较，发现BAWS的性能比SAWS平均高出7%，最高可达27%。对于非障碍密集型工作负载，我们证明了与GTO和SAWS相比，BAWS是性能中立的，而与LRR相比，BAWS的性能平均提高了5.7%(最高可提高22%)。BAWS的硬件成本限制在每个流多处理器(SM) 6字节。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2016 International Conference on Supercomputing

自引率

0.00%

发文量

期刊最新文献

Prefetching Techniques for Near-memory Throughput Processors Polly-ACC Transparent compilation to heterogeneous hardware Galaxyfly: A Novel Family of Flexible-Radix Low-Diameter Topologies for Large-Scales Interconnection Networks Parallel Transposition of Sparse Data Structures Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics