Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI:10.1109/SBAC-PAD.2012.28

J. F. Lima, T. Gautier, N. Maillard, Vincent Danjean

{"title":"Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs","authors":"J. F. Lima, T. Gautier, N. Maillard, Vincent Danjean","doi":"10.1109/SBAC-PAD.2012.28","DOIUrl":null,"url":null,"abstract":"The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"29 10","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2012.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用并发GPU操作在多GPU上高效窃取工作

百亿亿次计算的竞争自然导致了当前技术向多cpu /多gpu计算机的融合，基于数千个cpu和gpu通过PCI-Express总线或互连网络相互连接。为了利用这种高计算能力，程序员必须解决在混合架构上调度并行程序的问题。而且，由于GPU的性能增长速度比PCI总线的吞吐量快得多，数据传输必须由调度器有效地管理。本文针对多gpu计算节点，其中多个gpu连接到同一台机器。为了克服这些平台上的数据传输限制，可用的软件通常在执行之前计算任务的映射，该映射尊重它们的依赖关系并最小化全局数据传输。这种方法过于死板，不能使执行适应系统或应用程序负载的可能变化。我们提出了一个与上述正交的解决方案:Xkaapi软件堆栈的扩展，可以通过异步GPU任务利用多GPU系统的全部性能。Xkaapi通过使用标准的工作窃取算法来调度任务，并且运行时有效地利用并发GPU操作。运行时扩展使当前一代gpu上的数据传输和任务执行重叠成为可能。我们证明了重叠能力至少与计算调度决策一样重要，以减少并行程序的完成时间。我们对两个密集线性代数问题(矩阵乘积和Cholesky分解)的实验表明，我们的解决方案与其他基于静态调度的软件相比具有很强的竞争力。此外，我们能够维持峰值性能(约为。310 GFlop/s)在DGEMM上，即使对于不能完全存储在一个GPU内存中的矩阵。使用8个gpu时，相对于单个gpu，我们的速度提升了6.74。我们的Cholesky分解的性能，与任务之间更复杂的依赖关系，优于最先进的单gpu MAGMA代码的状态。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

自引率

0.00%

发文量

期刊最新文献

Using Heterogeneous Networks to Improve Energy Efficiency in Direct Coherence Protocols for Many-Core CMPs Cloud Workload Analysis with SWAT Energy-Performance Tradeoffs in Software Transactional Memory CSHARP: Coherence and SHaring Aware Cache Replacement Policies for Parallel Applications Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs