- Book学术

ACM Transactions on Computer Systems (TOCS) Pub Date : 2015-08-31 DOI:10.1145/2798725

Janghaeng Lee, M. Samadi, Yongjun Park, S. Mahlke

{"title":"SKMD","authors":"Janghaeng Lee, M. Samadi, Yongjun Park, S. Mahlke","doi":"10.1145/2798725","DOIUrl":null,"url":null,"abstract":"Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. This work distribution can be a poor solution as it underutilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this article, we present the single-kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 28% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems (TOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2798725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

摘要

传统上，CPU和GPU上的异构计算为每个设备使用固定的角色:GPU通过利用其大量内核来处理数据并行工作，而CPU处理非数据并行工作，例如顺序代码或数据传输管理。这种工作分布可能是一个糟糕的解决方案，因为它没有充分利用CPU，很难推广到单一CPU- gpu组合之外，并且可能浪费大量的时间来传输数据。此外，在许多工作负载上，cpu在性能上与gpu竞争，因此简单地根据固定角色对工作进行分区可能是一个糟糕的选择。在本文中，我们介绍了单内核多设备(SKMD)系统，这是一个框架，它透明地协调跨多个非对称cpu和gpu的单个数据并行内核的协作执行。程序员负责在OpenCL中开发单个数据并行内核，而系统自动将工作负载划分为任意一组设备，生成内核来执行部分工作负载，并有效地将部分输出合并在一起。目标是通过最大限度地利用所有可用资源来执行内核来提高性能。SKMD处理了暴露的数据传输成本和gpu在输入大小方面的性能变化的困难挑战。在实际硬件上，与一组流行的OpenCL内核的最快设备执行策略相比，SKMD在具有一个多核CPU和两个非对称gpu的系统上实现了28%的平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SKMD

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. This work distribution can be a poor solution as it underutilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this article, we present the single-kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 28% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Computer Systems (TOCS)

自引率

0.00%

发文量