Orchestrating Multiple Data-Parallel Kernels on Multiple Devices

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI:10.1109/PACT.2015.14

Janghaeng Lee, M. Samadi, S. Mahlke

{"title":"Orchestrating Multiple Data-Parallel Kernels on Multiple Devices","authors":"Janghaeng Lee, M. Samadi, S. Mahlke","doi":"10.1109/PACT.2015.14","DOIUrl":null,"url":null,"abstract":"Traditionally, programmers and software tools have focused on mapping a single data-parallel kernel onto a heterogeneous computing system consisting of multiple general-purpose processors (CPUS) and graphics processing units (GPUs). These methodologies break down as application complexity grows to contain multiple communicating data-parallel kernels. This paper introduces MKMD, an automatic system for mapping multiple kernels across multiple computing devices in a seamless manner. MKMD is a two phased approach that combines coarse grain scheduling of indivisible kernels followed by opportunistic fine-grained workgroup-level partitioning to exploit idle resources. During this process, MKMD considers kernel dependencies and the underlying systems along with the execution time model built with a few sets of profile data. With the scheduling decision, MKMD transparently manages the order of executions and data transfers for each device. On a real machine with one CPU and two different GPUs, MKMD achieves a mean speedup of 1.89x compared to the in-order execution on the fastest device for a set of applications with multiple kernels. 53% of this speedup comes from the coarse-grained scheduling and the other 47% is the result of the fine-grained partitioning.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

Traditionally, programmers and software tools have focused on mapping a single data-parallel kernel onto a heterogeneous computing system consisting of multiple general-purpose processors (CPUS) and graphics processing units (GPUs). These methodologies break down as application complexity grows to contain multiple communicating data-parallel kernels. This paper introduces MKMD, an automatic system for mapping multiple kernels across multiple computing devices in a seamless manner. MKMD is a two phased approach that combines coarse grain scheduling of indivisible kernels followed by opportunistic fine-grained workgroup-level partitioning to exploit idle resources. During this process, MKMD considers kernel dependencies and the underlying systems along with the execution time model built with a few sets of profile data. With the scheduling decision, MKMD transparently manages the order of executions and data transfers for each device. On a real machine with one CPU and two different GPUs, MKMD achieves a mean speedup of 1.89x compared to the in-order execution on the fastest device for a set of applications with multiple kernels. 53% of this speedup comes from the coarse-grained scheduling and the other 47% is the result of the fine-grained partitioning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在多个设备上编排多个数据并行内核

传统上，程序员和软件工具关注于将单个数据并行内核映射到由多个通用处理器(cpu)和图形处理单元(gpu)组成的异构计算系统。随着应用程序复杂性的增长，这些方法会分解为包含多个通信数据并行内核。本文介绍了MKMD，一个在多个计算设备上无缝映射多个内核的自动系统。MKMD是一种分两阶段的方法，它结合了不可分割内核的粗粒度调度和机会性的细粒度工作组级分区来利用空闲资源。在此过程中，MKMD考虑内核依赖关系和底层系统，以及使用几组配置文件数据构建的执行时间模型。通过调度决策，MKMD透明地管理每个设备的执行顺序和数据传输。在一台具有一个CPU和两个不同gpu的真实机器上，对于一组具有多个内核的应用程序，与在最快的设备上按顺序执行相比，MKMD实现了1.89倍的平均加速。这种加速的53%来自粗粒度调度，另外47%是细粒度分区的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量

期刊最新文献

Storage Consolidation on SSDs: Not Always a Panacea, but Can We Ease the Pain? AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures Scalable Task Scheduling and Synchronization Using Hierarchical Effects Scalable SIMD-Efficient Graph Processing on GPUs