PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications

Proceedings of the 37th International Conference on Supercomputing Pub Date : 2022-04-05 DOI:10.1145/3577193.3593705

Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka

{"title":"PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications","authors":"Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka","doi":"10.1145/3577193.3593705","DOIUrl":null,"url":null,"abstract":"Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of 2.12x for 2D stencils and 1.24x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of 4.86x in smaller SpMV datasets from SuiteSparse and 1.43x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"67 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of 2.12x for 2D stencils and 1.24x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of 4.86x in smaller SpMV datasets from SuiteSparse and 1.43x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PERKS:用于迭代内存绑定GPU应用程序的位置优化执行模型

迭代内存约束求解器通常出现在高性能计算代码中。典型的GPU实现在主机端有一个循环，它调用GPU内核的时间和算法步骤一样多。每个内核的终止隐式地在每个时间步推进解决方案后充当所需的屏障。我们提出了一个运行内存约束迭代GPU内核的执行模型:持久内核(PERsistent kernels, PERKS)。在该模型中，时间循环被移动到持久内核中，并使用设备范围的屏障进行同步。然后，我们通过在未使用的寄存器和共享内存中缓存每个时间步的输出子集来减少设备内存的流量。PERKS可以推广到任何迭代解算器:它们在很大程度上独立于解算器的执行。我们解释了PERKS的设计原理，并展示了PERKS在各种迭代2D/3D模板基准测试中的有效性(2D模板的几何加速为2.12倍，3D模板的几何加速为1.24倍)，以及Krylov子空间共轭梯度求解器(来自SuiteSparse的较小SpMV数据集的几何加速为4.86倍，大型SpMV数据集的几何加速为1.43倍)。所有基于perks的实现可在:https://github.com/neozhang307/PERKS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 37th International Conference on Supercomputing

自引率

0.00%

发文量