Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation

Workshop Proceedings of the 49th International Conference on Parallel Processing Pub Date : 2020-08-17 DOI:10.1145/3409390.3409403

Alexander Matz, J. Doerfert, H. Fröning

{"title":"Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation","authors":"Alexander Matz, J. Doerfert, H. Fröning","doi":"10.1145/3409390.3409403","DOIUrl":null,"url":null,"abstract":"GPUs are well-established in domains outside of computer graphics, including scientific computing, artificial intelligence, data warehousing, and other computationally intensive areas. Their execution model is based on a thread hierarchy and suggests that GPU workloads can generally be safely partitioned along the boundaries of thread blocks. However, the most efficient partitioning strategy is highly dependent on the application’s memory access patterns, and usually a tedious task for programmers in terms of decision and implementation. We leverage this observation for a concept that automatically compiles single-GPU code to multi-GPU applications. We present the idea and a prototype implementation of this concept and validate both on a selection of benchmarks. In particular, we illustrate our use of 1) polyhedral compilation to model memory accesses, 2) a runtime library to track GPU buffers and identify stale data, 3) IR transformations for the partitioning of GPU kernels, and 4) a custom preprocessor that rewrites CUDA host code to utilize multiple GPUs. This work focuses on applications with regular access patterns on global memory and the toolchain to fully automatically compile CUDA applications without requiring any user intervention. Our benchmarks compare single-device CUDA binaries produced by NVIDIA’s reference compiler to binaries produced for multiple GPUs using our toolchain. We report speedups of up to 12.4x for 16 Kepler-class GPUs.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop Proceedings of the 49th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3409390.3409403","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

GPUs are well-established in domains outside of computer graphics, including scientific computing, artificial intelligence, data warehousing, and other computationally intensive areas. Their execution model is based on a thread hierarchy and suggests that GPU workloads can generally be safely partitioned along the boundaries of thread blocks. However, the most efficient partitioning strategy is highly dependent on the application’s memory access patterns, and usually a tedious task for programmers in terms of decision and implementation. We leverage this observation for a concept that automatically compiles single-GPU code to multi-GPU applications. We present the idea and a prototype implementation of this concept and validate both on a selection of benchmarks. In particular, we illustrate our use of 1) polyhedral compilation to model memory accesses, 2) a runtime library to track GPU buffers and identify stale data, 3) IR transformations for the partitioning of GPU kernels, and 4) a custom preprocessor that rewrites CUDA host code to utilize multiple GPUs. This work focuses on applications with regular access patterns on global memory and the toolchain to fully automatically compile CUDA applications without requiring any user intervention. Our benchmarks compare single-device CUDA binaries produced by NVIDIA’s reference compiler to binaries produced for multiple GPUs using our toolchain. We report speedups of up to 12.4x for 16 Kepler-class GPUs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用多面体编译的数据并行核的自动分区

gpu在计算机图形学之外的领域已经建立，包括科学计算、人工智能、数据仓库和其他计算密集型领域。他们的执行模型基于线程层次结构，并建议GPU工作负载通常可以沿着线程块的边界安全地分区。然而，最有效的分区策略高度依赖于应用程序的内存访问模式，对于程序员来说，在决策和实现方面通常是一项繁琐的任务。我们利用这一观察来实现自动将单gpu代码编译为多gpu应用程序的概念。我们提出了这个想法和这个概念的原型实现，并在一系列基准测试中进行了验证。特别地，我们说明了我们使用1)多面体编译来模拟内存访问，2)一个运行时库来跟踪GPU缓冲区和识别陈旧数据，3)用于GPU内核分区的IR转换，以及4)一个自定义预处理器重写CUDA主机代码以利用多个GPU。这项工作的重点是在全局内存和工具链上具有常规访问模式的应用程序，以完全自动编译CUDA应用程序，而无需任何用户干预。我们的基准测试比较了由NVIDIA参考编译器生成的单设备CUDA二进制文件和使用我们的工具链为多个gpu生成的二进制文件。我们报告16颗开普勒级gpu的速度高达12.4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Workshop Proceedings of the 49th International Conference on Parallel Processing

自引率

0.00%

发文量