A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI:10.1109/SAAHPC.2011.18

Mitchel D. Horton, S. Tomov, J. Dongarra

{"title":"A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures","authors":"Mitchel D. Horton, S. Tomov, J. Dongarra","doi":"10.1109/SAAHPC.2011.18","DOIUrl":null,"url":null,"abstract":"Three out of the top four supercomputers in the November 2010 TOP500 list of the world's most powerful supercomputers use NVIDIA GPUs to accelerate computations. Ninety-five systems from the list are using processors with six or more cores. Three-hundred-sixty-five systems use quad-core processor-based systems. Thirty-seven systems are using dual-core processors. The large-scale enabling of hybrid graphics processing unit (GPU)-based multicore platforms for computational science by developing fundamental numerical libraries (in particular, libraries in the area of dense linear algebra) for them has been underway for some time. We present a class of algorithms based largely on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. The algorithms extend what is currently available in the Matrix Algebra for GPU and Multicore Architectures (MAGMA) Library for performing Cholesky, QR, and LU factorizations using a single core or socket and a single GPU. The extensions occur in two areas. First, panels factored on the CPU using LAPACK are, instead, done in parallel using a highly optimized dynamic asynchronous scheduled algorithm on some number of CPU cores. Second, the remaining CPU cores are used to update the rightmost panels of the matrix in parallel.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 Symposium on Application Accelerators in High-Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAAHPC.2011.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 39

Abstract

Three out of the top four supercomputers in the November 2010 TOP500 list of the world's most powerful supercomputers use NVIDIA GPUs to accelerate computations. Ninety-five systems from the list are using processors with six or more cores. Three-hundred-sixty-five systems use quad-core processor-based systems. Thirty-seven systems are using dual-core processors. The large-scale enabling of hybrid graphics processing unit (GPU)-based multicore platforms for computational science by developing fundamental numerical libraries (in particular, libraries in the area of dense linear algebra) for them has been underway for some time. We present a class of algorithms based largely on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. The algorithms extend what is currently available in the Matrix Algebra for GPU and Multicore Architectures (MAGMA) Library for performing Cholesky, QR, and LU factorizations using a single core or socket and a single GPU. The extensions occur in two areas. First, panels factored on the CPU using LAPACK are, instead, done in parallel using a highly optimized dynamic asynchronous scheduled algorithm on some number of CPU cores. Second, the remaining CPU cores are used to update the rightmost panels of the matrix in parallel.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一类用于多核和GPU架构的混合LAPACK算法

在2010年11月全球最强大的超级计算机TOP500榜单中，前四台超级计算机中有三台使用NVIDIA gpu来加速计算。列表中的95个系统使用六核或更多核的处理器。365个系统使用基于四核处理器的系统。37个系统使用双核处理器。通过开发基本的数值库(特别是密集线性代数领域的库)，大规模启用基于混合图形处理单元(GPU)的计算科学多核平台已经进行了一段时间。我们提出了一类主要基于软件基础设施的算法，这些基础设施已经为同构多核和基于gpu的混合计算开发了。该算法扩展了目前可用的GPU和多核架构(MAGMA)库中的矩阵代数，用于使用单核或套接字和单个GPU执行Cholesky, QR和LU分解。扩展发生在两个方面。首先，使用LAPACK在CPU上分解的面板是在一定数量的CPU内核上使用高度优化的动态异步调度算法并行完成的。其次，剩余的CPU内核用于并行更新矩阵最右边的面板。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 Symposium on Application Accelerators in High-Performance Computing

自引率

0.00%

发文量