QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.90

E. Agullo, C. Augonnet, J. Dongarra, Mathieu Faverge, H. Ltaief, Samuel Thibault, S. Tomov

{"title":"QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators","authors":"E. Agullo, C. Augonnet, J. Dongarra, Mathieu Faverge, H. Ltaief, Samuel Thibault, S. Tomov","doi":"10.1109/IPDPS.2011.90","DOIUrl":null,"url":null,"abstract":"One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators-based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the benefits of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an efficient runtime system, Star PU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high performance in a portable way thanks to a fine interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bounds that we obtained using Linear Programming.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"117","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.90","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 117

Abstract

One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators-based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the benefits of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an efficient runtime system, Star PU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high performance in a portable way thanks to a fine interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bounds that we obtained using Linear Programming.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于多个GPU加速器的多核节点QR分解

百亿亿级架构设计的主要趋势之一是使用GPU加速器增强的多核节点。因此，最大限度地利用基于混合加速器的节点的所有资源是迈向百亿亿次计算的基本步骤。在这篇文章中，我们提出了这样一个节点的高效QR分解的设计。我们的方法分为三步。第一步是将QR分解表示为一系列精心选择粒度的任务，这些任务旨在在CPU核心或GPU上执行。我们表明，我们可以有效地适应文献中的高级算法，这些算法最初是为同构多核架构设计的。第二步包括设计实现每个单独任务的内核。我们使用以前工作的CPU内核，并为gpu提供新的内核，以补充MAGMA库中已有的内核。我们展示了这些GPU内核对性能的影响。特别是，我们介绍了新的混合CPU/GPU内核的好处。最后一步是在计算单元上调度这些任务。我们提出了两种可选的方法，分别基于静态和动态调度。在静态调度的情况下，我们利用调度的先验知识来执行连续的优化，从而获得非常高的性能。然而，我们强调了这种方法缺乏可移植性，以及它在相对同构节点上相对简单的算法的局限性。或者，通过依赖一个高效的运行时系统，Star PU，负责确保数据的可用性和一致性，我们可以在复杂的异构节点上调度更复杂的算法，具有更高的生产力。在后一种情况下，我们展示了由于应用程序和运行时系统之间良好的交互，我们可以以可移植的方式实现高性能。我们证明了所得到的性能非常接近我们用线性规划得到的理论上界。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量