An Efficient Vectorization Approach to Nested Thread-level Parallelism for CUDA GPUs

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI:10.1109/PACT.2015.56

Shixiong Xu, David Gregg

引用次数: 0

Abstract

Nested thread-level parallelism (TLP) is pervasive in real applications. For example, 75% (14 out of 19) of the applications in the Rodinia benchmark for heterogeneous accelerators contain kernels with nested thread-level parallelism. Efficiently mapping the enclosed nested parallelism to the GPU threads in the C-to-CUDA compilation (OpenACC in this paper) is becoming more and more important. This mapping problem is two folds: suitable execution models and efficient mapping strategies of the nested parallelism.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种有效的CUDA gpu嵌套线程级并行的矢量化方法

嵌套线程级并行(TLP)在实际应用程序中非常普遍。例如，在针对异构加速器的Rodinia基准测试中，75%(19个中的14个)应用程序包含嵌套线程级并行性的内核。在C-to-CUDA编译(本文称为OpenACC)中，将封闭的嵌套并行性有效地映射到GPU线程变得越来越重要。这个映射问题包括两个方面:合适的执行模型和有效的嵌套并行映射策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量

期刊最新文献

Storage Consolidation on SSDs: Not Always a Panacea, but Can We Ease the Pain? AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures Scalable Task Scheduling and Synchronization Using Hierarchical Effects Scalable SIMD-Efficient Graph Processing on GPUs