Warp-Consolidation: A Novel Execution Model for GPUs

Ang Li, Weifeng Liu, Linnan Wang, K. Barker, S. Song
{"title":"Warp-Consolidation: A Novel Execution Model for GPUs","authors":"Ang Li, Weifeng Liu, Linnan Wang, K. Barker, S. Song","doi":"10.1145/3205289.3205294","DOIUrl":null,"url":null,"abstract":"With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3205289.3205294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32

Abstract

With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
扭曲整合:一种新的gpu执行模型
随着现代gpu计算能力的空前发展和内存带宽的不断扩展,并行通信和同步很快成为持续性能扩展的主要问题。对于新兴的大数据应用来说尤其如此。当前的技术和设计趋势表明,分配更多的轻量级cta来更独立地处理单个任务,而不是依赖一些可能暴露cta内部数据重用机会的重负载cta,因为同步、通信和合作的开销可能大大超过在重负载cta中利用有限数据重用所带来的好处。本文沿着这一趋势,提出了一种新的现代GPU执行模型,该模型将CTA执行层次隐藏在经典GPU执行模型中;同时公开最初隐藏的翘曲级执行。具体来说,它依赖于单个经纬仪来承担原始cta的任务。主要观察结果是,通过替换传统的warp间通信(例如,通过共享内存),合作(例如,通过bar原语)和同步(例如,通过CTA屏障),使用更有效的warp内通信(例如,通过寄存器洗牌),合作(例如,通过warp投票)和同步(自然同步执行)在warp内的SIMD-lanes,可以实现显着的性能提升。我们分析了该设计的利弊,并提出了相应的解决方案,以应对潜在的负面影响。实验结果表明,我们提出的Warp-Consolidation执行模型在NVIDIA Kepler (Tesla-K80)、Maxwell (Tesla-M40)、Pascal (Tesla-P100)和Volta (Tesla-V100) gpu上的平均加速分别达到1.7倍、2.3倍、1.5倍和1.2倍(最高可达6.3倍、31倍、6.4倍和3.8倍),证明了其适用性和可移植性。我们的方法可以直接用于转换遗留代码或在现代商品gpu上编写新算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
ComPEND CELIA PA-SSD: A Page-Type Aware TLC SSD for Improved Write/Read Performance and Storage Efficiency GRU Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1