QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

E. Agullo, C. Augonnet, J. Dongarra, Mathieu Faverge, H. Ltaief, Samuel Thibault, S. Tomov
{"title":"QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators","authors":"E. Agullo, C. Augonnet, J. Dongarra, Mathieu Faverge, H. Ltaief, Samuel Thibault, S. Tomov","doi":"10.1109/IPDPS.2011.90","DOIUrl":null,"url":null,"abstract":"One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators-based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the benefits of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an efficient runtime system, Star PU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high performance in a portable way thanks to a fine interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bounds that we obtained using Linear Programming.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"117","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.90","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 117

Abstract

One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators-based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the benefits of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an efficient runtime system, Star PU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high performance in a portable way thanks to a fine interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bounds that we obtained using Linear Programming.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于多个GPU加速器的多核节点QR分解
百亿亿级架构设计的主要趋势之一是使用GPU加速器增强的多核节点。因此,最大限度地利用基于混合加速器的节点的所有资源是迈向百亿亿次计算的基本步骤。在这篇文章中,我们提出了这样一个节点的高效QR分解的设计。我们的方法分为三步。第一步是将QR分解表示为一系列精心选择粒度的任务,这些任务旨在在CPU核心或GPU上执行。我们表明,我们可以有效地适应文献中的高级算法,这些算法最初是为同构多核架构设计的。第二步包括设计实现每个单独任务的内核。我们使用以前工作的CPU内核,并为gpu提供新的内核,以补充MAGMA库中已有的内核。我们展示了这些GPU内核对性能的影响。特别是,我们介绍了新的混合CPU/GPU内核的好处。最后一步是在计算单元上调度这些任务。我们提出了两种可选的方法,分别基于静态和动态调度。在静态调度的情况下,我们利用调度的先验知识来执行连续的优化,从而获得非常高的性能。然而,我们强调了这种方法缺乏可移植性,以及它在相对同构节点上相对简单的算法的局限性。或者,通过依赖一个高效的运行时系统,Star PU,负责确保数据的可用性和一致性,我们可以在复杂的异构节点上调度更复杂的算法,具有更高的生产力。在后一种情况下,我们展示了由于应用程序和运行时系统之间良好的交互,我们可以以可移植的方式实现高性能。我们证明了所得到的性能非常接近我们用线性规划得到的理论上界。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Large-Scale Semantic Concept Detection on Manycore Platforms for Multimedia Mining Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures A Study of Parallel Particle Tracing for Steady-State and Time-Varying Flow Fields Smith-Waterman Alignment of Huge Sequences with GPU in Linear Space CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1