Ang Li, Weifeng Liu, Linnan Wang, K. Barker, S. Song
{"title":"Warp-Consolidation: A Novel Execution Model for GPUs","authors":"Ang Li, Weifeng Liu, Linnan Wang, K. Barker, S. Song","doi":"10.1145/3205289.3205294","DOIUrl":null,"url":null,"abstract":"With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3205289.3205294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32
Abstract
With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.