Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

J. F. Lima, T. Gautier, N. Maillard, Vincent Danjean
{"title":"Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs","authors":"J. F. Lima, T. Gautier, N. Maillard, Vincent Danjean","doi":"10.1109/SBAC-PAD.2012.28","DOIUrl":null,"url":null,"abstract":"The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"29 10","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2012.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29

Abstract

The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用并发GPU操作在多GPU上高效窃取工作
百亿亿次计算的竞争自然导致了当前技术向多cpu /多gpu计算机的融合,基于数千个cpu和gpu通过PCI-Express总线或互连网络相互连接。为了利用这种高计算能力,程序员必须解决在混合架构上调度并行程序的问题。而且,由于GPU的性能增长速度比PCI总线的吞吐量快得多,数据传输必须由调度器有效地管理。本文针对多gpu计算节点,其中多个gpu连接到同一台机器。为了克服这些平台上的数据传输限制,可用的软件通常在执行之前计算任务的映射,该映射尊重它们的依赖关系并最小化全局数据传输。这种方法过于死板,不能使执行适应系统或应用程序负载的可能变化。我们提出了一个与上述正交的解决方案:Xkaapi软件堆栈的扩展,可以通过异步GPU任务利用多GPU系统的全部性能。Xkaapi通过使用标准的工作窃取算法来调度任务,并且运行时有效地利用并发GPU操作。运行时扩展使当前一代gpu上的数据传输和任务执行重叠成为可能。我们证明了重叠能力至少与计算调度决策一样重要,以减少并行程序的完成时间。我们对两个密集线性代数问题(矩阵乘积和Cholesky分解)的实验表明,我们的解决方案与其他基于静态调度的软件相比具有很强的竞争力。此外,我们能够维持峰值性能(约为。310 GFlop/s)在DGEMM上,即使对于不能完全存储在一个GPU内存中的矩阵。使用8个gpu时,相对于单个gpu,我们的速度提升了6.74。我们的Cholesky分解的性能,与任务之间更复杂的依赖关系,优于最先进的单gpu MAGMA代码的状态。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Using Heterogeneous Networks to Improve Energy Efficiency in Direct Coherence Protocols for Many-Core CMPs Cloud Workload Analysis with SWAT Energy-Performance Tradeoffs in Software Transactional Memory CSHARP: Coherence and SHaring Aware Cache Replacement Policies for Parallel Applications Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1