Improving the Performance of CA-GMRES on Multicores with Multiple GPUs

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.48

I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, J. Dongarra

{"title":"Improving the Performance of CA-GMRES on Multicores with Multiple GPUs","authors":"I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, J. Dongarra","doi":"10.1109/IPDPS.2014.48","DOIUrl":null,"url":null,"abstract":"The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"47 8","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.48","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

Abstract

The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

提高CA-GMRES在多核多gpu上的性能

广义最小残差法(GMRES)是求解非对称线性方程组最常用的迭代方法之一。近年来，GMRES中避免通信的技术引起了人们的关注，因为与浮点运算相比，通信在现代计算机上变得越来越昂贵。由于图形处理单元(gpu)现在成为计算中的关键组件，我们研究了这些技术在具有多个gpu的多核cpu上的有效性。虽然我们在多个gpu上详细介绍了矩阵幂核的性能研究，但我们特别关注对GMRES的数值稳定性和性能有很大影响的正交化策略，特别是当矩阵变得稀疏或病态时。我们展示了两个八核英特尔Sandy Bridge cpu和三个NDIVIA Fermi GPU的实验结果，并证明通过避免GPU上或GPU之间的通信可以获得显着的速度提升。作为我们研究的一部分，我们研究了几种GPU内核的优化技术，这些技术也可以用于除GMRES之外的其他迭代求解器。因此，我们的研究不仅强调了在gpu上避免通信的重要性，而且还提供了这些优化技术对稀疏求解器性能的影响的见解，并且可能具有比GMRES更大的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量