Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2017-02-04 DOI:10.1145/3026937.3026940

H. Anzt, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí

{"title":"Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs","authors":"H. Anzt, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí","doi":"10.1145/3026937.3026940","DOIUrl":null,"url":null,"abstract":"In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner. Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3026937.3026940","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner. Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

gpu上块jacobi预调节器生成的批处理高斯-乔丹消去

在本文中，我们设计并评估了图形处理单元(gpu)上有效生成块雅可比预调节器的例程。具体而言，为了利用图形加速器的架构，我们开发了一个用于矩阵反演的批处理高斯-乔丹消去CUDA内核，该内核嵌入了隐式旋转技术，并在GPU寄存器中处理整个反演过程。此外，我们整合了提取和插入CUDA内核，快速建立了块jacobi预调节器。我们的实验将我们的实现与MAGMA库中的一系列批处理例程的性能进行了比较，这些例程通过带部分pivot的LU分解实现了反演。此外，我们使用来自SuiteSparse矩阵集合的各种稀疏矩阵，评估了块jacobi提取和插入步骤的不同策略的成本。最后，我们在应用于一组计算科学问题的迭代求解器的背景下评估完整块Jacobi预条件生成的效率，并量化其优于标量Jacobi预条件的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

自引率

0.00%

发文量