DistCL: A Framework for the Distributed Execution of OpenCL Kernels

2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems Pub Date : 2013-08-14 DOI:10.1109/MASCOTS.2013.77

Tahir Diop, Steven Gurfinkel, J. Anderson, Natalie D. Enright Jerger

{"title":"DistCL: A Framework for the Distributed Execution of OpenCL Kernels","authors":"Tahir Diop, Steven Gurfinkel, J. Anderson, Natalie D. Enright Jerger","doi":"10.1109/MASCOTS.2013.77","DOIUrl":null,"url":null,"abstract":"GPUs are used to speed up many scientific computations, however, to use several networked GPUs concurrently, the programmer must explicitly partition work and transmit data between devices. We propose DistCL, a novel framework that distributes the execution of penCL kernels across a GPU cluster. DistCL makes multiple distributed compute devices appear to be a single compute device. DistCL abstracts and manages many of the challenges associated with distributing a kernel across multiple devices including: (1) partitioning work into smaller parts, (2) scheduling these parts across the network, (3) partitioning memory so that each part of memory is written to by at most one device, and (4) tracking and transferring these parts of memory. Converting an OpenCL application to DistCL is straightforward and requires little programmer effort. This makes it a powerful and valuable tool for exploring the distributed execution of OpenCL kernels. We compare DistCL to SnuCL, which also facilitates the distribution of OpenCL kernels. We also give some insights: distributed tasks favor more compute bound problems and favour large contiguous memory accesses. DistCL achieves a maximum speedup of 29.1 and average speedups of 7.3 when distributing kernels among 32 peers over an Infiniband cluster.","PeriodicalId":385538,"journal":{"name":"2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS.2013.77","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

GPUs are used to speed up many scientific computations, however, to use several networked GPUs concurrently, the programmer must explicitly partition work and transmit data between devices. We propose DistCL, a novel framework that distributes the execution of penCL kernels across a GPU cluster. DistCL makes multiple distributed compute devices appear to be a single compute device. DistCL abstracts and manages many of the challenges associated with distributing a kernel across multiple devices including: (1) partitioning work into smaller parts, (2) scheduling these parts across the network, (3) partitioning memory so that each part of memory is written to by at most one device, and (4) tracking and transferring these parts of memory. Converting an OpenCL application to DistCL is straightforward and requires little programmer effort. This makes it a powerful and valuable tool for exploring the distributed execution of OpenCL kernels. We compare DistCL to SnuCL, which also facilitates the distribution of OpenCL kernels. We also give some insights: distributed tasks favor more compute bound problems and favour large contiguous memory accesses. DistCL achieves a maximum speedup of 29.1 and average speedups of 7.3 when distributing kernels among 32 peers over an Infiniband cluster.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DistCL: OpenCL内核的分布式执行框架

gpu被用来加速许多科学计算，然而，要同时使用几个联网的gpu，程序员必须明确地划分工作并在设备之间传输数据。我们提出了DistCL，这是一个新颖的框架，它在GPU集群上分布执行penCL内核。DistCL使多个分布式计算设备看起来像单个计算设备。DistCL抽象和管理与跨多个设备分发内核相关的许多挑战，包括:(1)将工作划分为更小的部分，(2)跨网络调度这些部分，(3)对内存进行分区，以便内存的每个部分最多由一个设备写入，以及(4)跟踪和传输这些内存部分。将OpenCL应用程序转换为DistCL非常简单，程序员几乎不需要付出任何努力。这使得它成为探索OpenCL内核分布式执行的强大而有价值的工具。我们将DistCL与SnuCL进行比较，后者也有助于OpenCL内核的分发。我们还给出了一些见解:分布式任务更适合计算受限的问题，并且更适合大型连续内存访问。当在Infiniband集群上的32个节点之间分发内核时，DistCL实现了29.1的最大加速，7.3的平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems

自引率

0.00%

发文量

期刊最新文献

On Modeling Low-Power Wireless Protocols Based on Synchronous Packet Transmissions Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications On the Accuracy of Trace Replay Methods for File System Evaluation A Fix-and-Relax Model for Heterogeneous LTE-Based Networks Making JavaScript Better by Making It Even Slower