DistCL: A Framework for the Distributed Execution of OpenCL Kernels

Tahir Diop, Steven Gurfinkel, J. Anderson, Natalie D. Enright Jerger
{"title":"DistCL: A Framework for the Distributed Execution of OpenCL Kernels","authors":"Tahir Diop, Steven Gurfinkel, J. Anderson, Natalie D. Enright Jerger","doi":"10.1109/MASCOTS.2013.77","DOIUrl":null,"url":null,"abstract":"GPUs are used to speed up many scientific computations, however, to use several networked GPUs concurrently, the programmer must explicitly partition work and transmit data between devices. We propose DistCL, a novel framework that distributes the execution of penCL kernels across a GPU cluster. DistCL makes multiple distributed compute devices appear to be a single compute device. DistCL abstracts and manages many of the challenges associated with distributing a kernel across multiple devices including: (1) partitioning work into smaller parts, (2) scheduling these parts across the network, (3) partitioning memory so that each part of memory is written to by at most one device, and (4) tracking and transferring these parts of memory. Converting an OpenCL application to DistCL is straightforward and requires little programmer effort. This makes it a powerful and valuable tool for exploring the distributed execution of OpenCL kernels. We compare DistCL to SnuCL, which also facilitates the distribution of OpenCL kernels. We also give some insights: distributed tasks favor more compute bound problems and favour large contiguous memory accesses. DistCL achieves a maximum speedup of 29.1 and average speedups of 7.3 when distributing kernels among 32 peers over an Infiniband cluster.","PeriodicalId":385538,"journal":{"name":"2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS.2013.77","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

GPUs are used to speed up many scientific computations, however, to use several networked GPUs concurrently, the programmer must explicitly partition work and transmit data between devices. We propose DistCL, a novel framework that distributes the execution of penCL kernels across a GPU cluster. DistCL makes multiple distributed compute devices appear to be a single compute device. DistCL abstracts and manages many of the challenges associated with distributing a kernel across multiple devices including: (1) partitioning work into smaller parts, (2) scheduling these parts across the network, (3) partitioning memory so that each part of memory is written to by at most one device, and (4) tracking and transferring these parts of memory. Converting an OpenCL application to DistCL is straightforward and requires little programmer effort. This makes it a powerful and valuable tool for exploring the distributed execution of OpenCL kernels. We compare DistCL to SnuCL, which also facilitates the distribution of OpenCL kernels. We also give some insights: distributed tasks favor more compute bound problems and favour large contiguous memory accesses. DistCL achieves a maximum speedup of 29.1 and average speedups of 7.3 when distributing kernels among 32 peers over an Infiniband cluster.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DistCL: OpenCL内核的分布式执行框架
gpu被用来加速许多科学计算,然而,要同时使用几个联网的gpu,程序员必须明确地划分工作并在设备之间传输数据。我们提出了DistCL,这是一个新颖的框架,它在GPU集群上分布执行penCL内核。DistCL使多个分布式计算设备看起来像单个计算设备。DistCL抽象和管理与跨多个设备分发内核相关的许多挑战,包括:(1)将工作划分为更小的部分,(2)跨网络调度这些部分,(3)对内存进行分区,以便内存的每个部分最多由一个设备写入,以及(4)跟踪和传输这些内存部分。将OpenCL应用程序转换为DistCL非常简单,程序员几乎不需要付出任何努力。这使得它成为探索OpenCL内核分布式执行的强大而有价值的工具。我们将DistCL与SnuCL进行比较,后者也有助于OpenCL内核的分发。我们还给出了一些见解:分布式任务更适合计算受限的问题,并且更适合大型连续内存访问。当在Infiniband集群上的32个节点之间分发内核时,DistCL实现了29.1的最大加速,7.3的平均加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
On Modeling Low-Power Wireless Protocols Based on Synchronous Packet Transmissions Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications On the Accuracy of Trace Replay Methods for File System Evaluation A Fix-and-Relax Model for Heterogeneous LTE-Based Networks Making JavaScript Better by Making It Even Slower
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1