Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs

H. Anzt, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí
{"title":"Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs","authors":"H. Anzt, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí","doi":"10.1145/3026937.3026940","DOIUrl":null,"url":null,"abstract":"In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner. Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3026937.3026940","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26

Abstract

In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner. Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
gpu上块jacobi预调节器生成的批处理高斯-乔丹消去
在本文中,我们设计并评估了图形处理单元(gpu)上有效生成块雅可比预调节器的例程。具体而言,为了利用图形加速器的架构,我们开发了一个用于矩阵反演的批处理高斯-乔丹消去CUDA内核,该内核嵌入了隐式旋转技术,并在GPU寄存器中处理整个反演过程。此外,我们整合了提取和插入CUDA内核,快速建立了块jacobi预调节器。我们的实验将我们的实现与MAGMA库中的一系列批处理例程的性能进行了比较,这些例程通过带部分pivot的LU分解实现了反演。此外,我们使用来自SuiteSparse矩阵集合的各种稀疏矩阵,评估了块jacobi提取和插入步骤的不同策略的成本。最后,我们在应用于一组计算科学问题的迭代求解器的背景下评估完整块Jacobi预条件生成的效率,并量化其优于标量Jacobi预条件的好处。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
High Performance Detection of Strongly Connected Components in Sparse Graphs on GPUs PETRAS: Performance, Energy and Thermal Aware Resource Allocation and Scheduling for Heterogeneous Systems TaskInsight: Understanding Task Schedules Effects on Memory and Performance Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views A high-performance portable abstract interface for explicit SIMD vectorization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1