Parallel QR Factorization of Block Low-rank Matrices

M. R. Apriansyah, Rio Yokota
{"title":"Parallel QR Factorization of Block Low-rank Matrices","authors":"M. R. Apriansyah, Rio Yokota","doi":"10.1145/3538647","DOIUrl":null,"url":null,"abstract":"We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of 𝒪(mn), while the tiled BLR-QR exhibits 𝒪(mn1.5 complexity. However, the tiled BLR-QR has finer task granularity that allows parallel task-based execution on shared memory systems. We compare the block-column-wise BLR-QR using fork-join parallelism with tiled BLR-QR using task-based parallelism. We also compare these two implementations of Householder BLR-QR with a block-column-wise Modified Gram–Schmidt (MGS) BLR-QR using fork-join parallelism and a state-of-the-art vendor-optimized dense Householder QR in Intel MKL. For a matrix of size 131k × 65k, all BLR methods are more than an order of magnitude faster than the dense QR in MKL. Our methods are also robust to ill conditioning and produce better orthogonal factors than the existing MGS-based method. On a CPU with 64 cores, our parallel tiled Householder and block-column-wise Householder algorithms show a speedup of 50 and 37 times, respectively.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"275 1","pages":"1 - 28"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Mathematical Software (TOMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3538647","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of 𝒪(mn), while the tiled BLR-QR exhibits 𝒪(mn1.5 complexity. However, the tiled BLR-QR has finer task granularity that allows parallel task-based execution on shared memory systems. We compare the block-column-wise BLR-QR using fork-join parallelism with tiled BLR-QR using task-based parallelism. We also compare these two implementations of Householder BLR-QR with a block-column-wise Modified Gram–Schmidt (MGS) BLR-QR using fork-join parallelism and a state-of-the-art vendor-optimized dense Householder QR in Intel MKL. For a matrix of size 131k × 65k, all BLR methods are more than an order of magnitude faster than the dense QR in MKL. Our methods are also robust to ill conditioning and produce better orthogonal factors than the existing MGS-based method. On a CPU with 64 cores, our parallel tiled Householder and block-column-wise Householder algorithms show a speedup of 50 and 37 times, respectively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
块低秩矩阵的并行QR分解
我们提出了两种用于块低秩(BLR)矩阵的Householder QR分解的新算法:一种是执行块列QR,另一种是基于平铺QR。我们展示了块列算法如何利用BLR结构来实现 (mn)的算术复杂度,而平铺式BLR- qr则具有 (mn1.5)的复杂度。然而,平摊的BLR-QR具有更精细的任务粒度,允许在共享内存系统上并行执行基于任务的任务。我们比较了使用叉连接并行的块列式BLR-QR和使用基于任务的并行的平铺式BLR-QR。我们还比较了Householder BLR-QR的这两种实现,即使用fork-join并行的块列式Modified Gram-Schmidt (MGS) BLR-QR和Intel MKL中最先进的供应商优化的密集Householder QR。对于大小为131k × 65k的矩阵,所有BLR方法都比MKL中的密集QR快一个数量级以上。与现有的基于mgs的方法相比,我们的方法对不良条件具有较强的鲁棒性,并且产生了更好的正交因子。在64核的CPU上,我们的并行平铺Householder和块列Householder算法分别显示了50倍和37倍的加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Configurable Open-source Data Structure for Distributed Conforming Unstructured Homogeneous Meshes with GPU Support Algorithm 1027: NOMAD Version 4: Nonlinear Optimization with the MADS Algorithm Toward Accurate and Fast Summation Algorithm 1028: VTMOP: Solver for Blackbox Multiobjective Optimization Problems Parallel QR Factorization of Block Low-rank Matrices
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1