Hybrid CPU-GPU scheduling and execution of tree traversals

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI:10.1145/2925426.2926261

Jianqiao Liu, Nikhil Hegde, Milind Kulkarni

{"title":"Hybrid CPU-GPU scheduling and execution of tree traversals","authors":"Jianqiao Liu, Nikhil Hegde, Milind Kulkarni","doi":"10.1145/2925426.2926261","DOIUrl":null,"url":null,"abstract":"GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in mapping irregular applications to GPUs: applications with unpredictable, data-dependent behaviors. While most of the work in this space has focused on ad hoc implementations of specific algorithms, recent work has looked at generic techniques for mapping a large class of tree traversal algorithms to GPUs, through careful restructuring of the tree traversal algorithms to make them behave more regularly. Unfortunately, even this general approach for GPU execution of tree traversal algorithms is reliant on ad hoc, hand-written, algorithm-specific scheduling (i.e., assignment of threads to warps) to achieve high performance. The key challenge of scheduling is that it is a highly irregular process, that requires the inspection of thread behavior and then careful sorting of those threads into warps. In this paper, we present a novel scheduling and execution technique for tree traversal algorithms that is both general and automatic. The key novelty is a hybrid, inspector-executor approach: the GPU partially executes tasks to inspect thread behavior and transmits information back to the CPU, which uses that information to perform the scheduling itself, before executing the remaining, carefully scheduled, portion of the traversals on the GPU. We applied this framework to six tree traversal algorithms, achieving significant speedups over optimized GPU code that does not perform application-specific scheduling. Further, we show that in many cases, our hybrid approach is able to deliver better performance even than GPU code that uses handtuned, application-specific scheduling.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2925426.2926261","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in mapping irregular applications to GPUs: applications with unpredictable, data-dependent behaviors. While most of the work in this space has focused on ad hoc implementations of specific algorithms, recent work has looked at generic techniques for mapping a large class of tree traversal algorithms to GPUs, through careful restructuring of the tree traversal algorithms to make them behave more regularly. Unfortunately, even this general approach for GPU execution of tree traversal algorithms is reliant on ad hoc, hand-written, algorithm-specific scheduling (i.e., assignment of threads to warps) to achieve high performance. The key challenge of scheduling is that it is a highly irregular process, that requires the inspection of thread behavior and then careful sorting of those threads into warps. In this paper, we present a novel scheduling and execution technique for tree traversal algorithms that is both general and automatic. The key novelty is a hybrid, inspector-executor approach: the GPU partially executes tasks to inspect thread behavior and transmits information back to the CPU, which uses that information to perform the scheduling itself, before executing the remaining, carefully scheduled, portion of the traversals on the GPU. We applied this framework to six tree traversal algorithms, achieving significant speedups over optimized GPU code that does not perform application-specific scheduling. Further, we show that in many cases, our hybrid approach is able to deliver better performance even than GPU code that uses handtuned, application-specific scheduling.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

混合CPU-GPU调度和树遍历的执行

gpu提供了大规模、高效的并行性。然而，利用这种并行性需要对代码进行仔细的结构化，以处理SIMT执行模型的限制。近年来，人们对将不规则应用程序映射到gpu非常感兴趣:具有不可预测的、依赖数据的行为的应用程序。虽然这个领域的大部分工作都集中在特定算法的临时实现上，但最近的工作已经关注了将一大类树遍历算法映射到gpu的通用技术，通过仔细重构树遍历算法使其行为更有规律。不幸的是，即使是这种用于GPU执行树遍历算法的通用方法，也依赖于特别的、手写的、特定于算法的调度(即，将线程分配给warp)来实现高性能。调度的关键挑战在于它是一个高度不规则的过程，它需要检查线程的行为，然后仔细地将这些线程分类到经线中。在本文中，我们提出了一种新的树遍历算法的调度和执行技术，它既通用又自动。关键的新颖之处在于一种混合的，检查器-执行器的方法:GPU部分执行任务来检查线程行为，并将信息传回CPU, CPU使用该信息执行调度本身，然后在GPU上执行剩余的，精心安排的遍历部分。我们将此框架应用于六种树遍历算法，在不执行特定于应用程序的调度的优化GPU代码上实现了显着的加速。此外，我们表明，在许多情况下，我们的混合方法能够提供更好的性能，甚至比GPU代码使用手动调整，特定于应用程序的调度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2016 International Conference on Supercomputing

自引率

0.00%

发文量

期刊最新文献

Prefetching Techniques for Near-memory Throughput Processors Polly-ACC Transparent compilation to heterogeneous hardware Galaxyfly: A Novel Family of Flexible-Radix Low-Diameter Topologies for Large-Scales Interconnection Networks Parallel Transposition of Sparse Data Structures Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics