Parallel QR Factorization of Block Low-rank Matrices

ACM Transactions on Mathematical Software (TOMS) Pub Date : 2022-05-23 DOI:10.1145/3538647

M. R. Apriansyah, Rio Yokota

{"title":"Parallel QR Factorization of Block Low-rank Matrices","authors":"M. R. Apriansyah, Rio Yokota","doi":"10.1145/3538647","DOIUrl":null,"url":null,"abstract":"We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of 𝒪(mn), while the tiled BLR-QR exhibits 𝒪(mn1.5 complexity. However, the tiled BLR-QR has finer task granularity that allows parallel task-based execution on shared memory systems. We compare the block-column-wise BLR-QR using fork-join parallelism with tiled BLR-QR using task-based parallelism. We also compare these two implementations of Householder BLR-QR with a block-column-wise Modified Gram–Schmidt (MGS) BLR-QR using fork-join parallelism and a state-of-the-art vendor-optimized dense Householder QR in Intel MKL. For a matrix of size 131k × 65k, all BLR methods are more than an order of magnitude faster than the dense QR in MKL. Our methods are also robust to ill conditioning and produce better orthogonal factors than the existing MGS-based method. On a CPU with 64 cores, our parallel tiled Householder and block-column-wise Householder algorithms show a speedup of 50 and 37 times, respectively.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"275 1","pages":"1 - 28"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Mathematical Software (TOMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3538647","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of 𝒪(mn), while the tiled BLR-QR exhibits 𝒪(mn1.5 complexity. However, the tiled BLR-QR has finer task granularity that allows parallel task-based execution on shared memory systems. We compare the block-column-wise BLR-QR using fork-join parallelism with tiled BLR-QR using task-based parallelism. We also compare these two implementations of Householder BLR-QR with a block-column-wise Modified Gram–Schmidt (MGS) BLR-QR using fork-join parallelism and a state-of-the-art vendor-optimized dense Householder QR in Intel MKL. For a matrix of size 131k × 65k, all BLR methods are more than an order of magnitude faster than the dense QR in MKL. Our methods are also robust to ill conditioning and produce better orthogonal factors than the existing MGS-based method. On a CPU with 64 cores, our parallel tiled Householder and block-column-wise Householder algorithms show a speedup of 50 and 37 times, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

块低秩矩阵的并行QR分解

我们提出了两种用于块低秩(BLR)矩阵的Householder QR分解的新算法:一种是执行块列QR，另一种是基于平铺QR。我们展示了块列算法如何利用BLR结构来实现 (mn)的算术复杂度，而平铺式BLR- qr则具有 (mn1.5)的复杂度。然而，平摊的BLR-QR具有更精细的任务粒度，允许在共享内存系统上并行执行基于任务的任务。我们比较了使用叉连接并行的块列式BLR-QR和使用基于任务的并行的平铺式BLR-QR。我们还比较了Householder BLR-QR的这两种实现，即使用fork-join并行的块列式Modified Gram-Schmidt (MGS) BLR-QR和Intel MKL中最先进的供应商优化的密集Householder QR。对于大小为131k × 65k的矩阵，所有BLR方法都比MKL中的密集QR快一个数量级以上。与现有的基于mgs的方法相比，我们的方法对不良条件具有较强的鲁棒性，并且产生了更好的正交因子。在64核的CPU上，我们的并行平铺Householder和块列Householder算法分别显示了50倍和37倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Mathematical Software (TOMS)

自引率

0.00%

发文量

期刊最新文献

Configurable Open-source Data Structure for Distributed Conforming Unstructured Homogeneous Meshes with GPU Support Algorithm 1027: NOMAD Version 4: Nonlinear Optimization with the MADS Algorithm Toward Accurate and Fast Summation Algorithm 1028: VTMOP: Solver for Blackbox Multiobjective Optimization Problems Parallel QR Factorization of Block Low-rank Matrices