Warp-Consolidation: A Novel Execution Model for GPUs

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI:10.1145/3205289.3205294

Ang Li, Weifeng Liu, Linnan Wang, K. Barker, S. Song

{"title":"Warp-Consolidation: A Novel Execution Model for GPUs","authors":"Ang Li, Weifeng Liu, Linnan Wang, K. Barker, S. Song","doi":"10.1145/3205289.3205294","DOIUrl":null,"url":null,"abstract":"With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3205289.3205294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

扭曲整合:一种新的gpu执行模型

随着现代gpu计算能力的空前发展和内存带宽的不断扩展，并行通信和同步很快成为持续性能扩展的主要问题。对于新兴的大数据应用来说尤其如此。当前的技术和设计趋势表明，分配更多的轻量级cta来更独立地处理单个任务，而不是依赖一些可能暴露cta内部数据重用机会的重负载cta，因为同步、通信和合作的开销可能大大超过在重负载cta中利用有限数据重用所带来的好处。本文沿着这一趋势，提出了一种新的现代GPU执行模型，该模型将CTA执行层次隐藏在经典GPU执行模型中;同时公开最初隐藏的翘曲级执行。具体来说，它依赖于单个经纬仪来承担原始cta的任务。主要观察结果是，通过替换传统的warp间通信(例如，通过共享内存)，合作(例如，通过bar原语)和同步(例如，通过CTA屏障)，使用更有效的warp内通信(例如，通过寄存器洗牌)，合作(例如，通过warp投票)和同步(自然同步执行)在warp内的SIMD-lanes，可以实现显着的性能提升。我们分析了该设计的利弊，并提出了相应的解决方案，以应对潜在的负面影响。实验结果表明，我们提出的Warp-Consolidation执行模型在NVIDIA Kepler (Tesla-K80)、Maxwell (Tesla-M40)、Pascal (Tesla-P100)和Volta (Tesla-V100) gpu上的平均加速分别达到1.7倍、2.3倍、1.5倍和1.2倍(最高可达6.3倍、31倍、6.4倍和3.8倍)，证明了其适用性和可移植性。我们的方法可以直接用于转换遗留代码或在现代商品gpu上编写新算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助