Communication-Avoiding QR Decomposition for GPUs

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.15

Michael J. Anderson, Grey Ballard, J. Demmel, K. Keutzer

{"title":"Communication-Avoiding QR Decomposition for GPUs","authors":"Michael J. Anderson, Grey Ballard, J. Demmel, K. Keutzer","doi":"10.1109/IPDPS.2011.15","DOIUrl":null,"url":null,"abstract":"We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs by up to 17x for tall-skinny matrices and Intel's Math Kernel Library (MKL) by up to 12x. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 30x faster than if we use Intel's Math Kernel Library (MKL) singular value decomposition routine on a multicore CPU.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"101","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 101

Abstract

We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs by up to 17x for tall-skinny matrices and Intel's Math Kernel Library (MKL) by up to 12x. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 30x faster than if we use Intel's Math Kernel Library (MKL) singular value decomposition routine on a multicore CPU.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

gpu避免通信的QR分解

我们描述了一个完全在单个图形处理器(GPU)上运行的通信避免QR (CAQR)分解的实现。我们表明，CAQR提供的内存流量减少使我们能够在大量高瘦矩阵中优于现有的QR并行GPU实现。QR处理面板分解的其他GPU实现要么将工作发送到通用处理器，要么使用完全带宽受限的操作，从而导致数据传输开销。相比之下，我们的QR完全是在使用计算绑定内核的GPU上完成的，这意味着无论矩阵的宽度如何，性能都很好。因此，我们的性能比CULA(用于gpu的并行线性代数库)高出17倍，对于高瘦矩阵和英特尔的数学内核库(MKL)高出12倍。我们还讨论了作为激励应用的静止视频背景减法。我们采用了一种最新的统计方法，该方法需要多次迭代计算高瘦矩阵的奇异值分解。使用CAQR作为获得奇异值分解的第一步，我们能够比使用专门针对该矩阵大小调整的传统带宽限制GPU QR分解快3倍，比在多核CPU上使用英特尔的数学内核库(MKL)奇异值分解例程快30倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量