Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit最新文献

英文中文

The Minos Computing Library: efficient parallel programming for extremely heterogeneous systems Minos计算库:用于极端异构系统的高效并行编程

Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pub Date : 2020-02-23 DOI: 10.1145/3366428.3380770

R. Gioiosa, B. O. Mutlu, Seyong Lee, J. Vetter, Giulio Picierro, M. Cesati

Hardware specialization has become the silver bullet to achieve efficient high performance, from Systems-on-Chip systems, where hardware specialization can be "extreme", to large-scale HPC systems. As the complexity of the systems increases, so does the complexity of programming such architectures in a portable way. This work introduces the Minos Computing Library (MCL), as system software, programming model, and programming model runtime that facilitate programming extremely heterogeneous systems. MCL supports the execution of several multi-threaded applications within the same compute node, performs asynchronous execution of application tasks, efficiently balances computation across hardware resources, and provides performance portability. We show that code developed on a personal desktop automatically scales up to fully utilize powerful workstations with 8 GPUs and down to power-efficient embedded systems. MCL provides up to 17.5x speedup over OpenCL on NVIDIA DGX-1 systems and up to 1.88x speedup on single-GPU systems. In multi-application workloads, MCL's dynamic resource allocation provides up to 2.43x performance improvement over manual, static resources allocation.

硬件专门化已经成为实现高效高性能的灵丹妙药，从硬件专门化可以做到“极致”的片上系统，到大规模高性能计算系统。随着系统复杂性的增加，以可移植的方式对这种体系结构进行编程的复杂性也在增加。本工作介绍了Minos Computing Library (MCL)，作为系统软件、编程模型和编程模型运行时，它促进了对极端异构系统的编程。MCL支持在同一计算节点内执行多个多线程应用程序，执行应用程序任务的异步执行，有效地平衡硬件资源之间的计算，并提供性能可移植性。我们展示了在个人桌面上开发的代码可以自动扩展到充分利用具有8个gpu的强大工作站，并向下扩展到节能的嵌入式系统。MCL在NVIDIA DGX-1系统上提供高达17.5倍的OpenCL加速，在单gpu系统上提供高达1.88倍的加速。在多应用程序工作负载中，MCL的动态资源分配比手动静态资源分配提供了高达2.43倍的性能改进。

{"title":"The Minos Computing Library: efficient parallel programming for extremely heterogeneous systems","authors":"R. Gioiosa, B. O. Mutlu, Seyong Lee, J. Vetter, Giulio Picierro, M. Cesati","doi":"10.1145/3366428.3380770","DOIUrl":"https://doi.org/10.1145/3366428.3380770","url":null,"abstract":"Hardware specialization has become the silver bullet to achieve efficient high performance, from Systems-on-Chip systems, where hardware specialization can be \"extreme\", to large-scale HPC systems. As the complexity of the systems increases, so does the complexity of programming such architectures in a portable way. This work introduces the Minos Computing Library (MCL), as system software, programming model, and programming model runtime that facilitate programming extremely heterogeneous systems. MCL supports the execution of several multi-threaded applications within the same compute node, performs asynchronous execution of application tasks, efficiently balances computation across hardware resources, and provides performance portability. We show that code developed on a personal desktop automatically scales up to fully utilize powerful workstations with 8 GPUs and down to power-efficient embedded systems. MCL provides up to 17.5x speedup over OpenCL on NVIDIA DGX-1 systems and up to 1.88x speedup on single-GPU systems. In multi-application workloads, MCL's dynamic resource allocation provides up to 2.43x performance improvement over manual, static resources allocation.","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123587454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

GPGPU performance estimation for frequency scaling using cross-benchmarking 使用交叉基准测试的GPGPU频率缩放性能估计

Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pub Date : 2020-02-19 DOI: 10.1145/3366428.3380767

Qiang Wang, Chengjian Liu, X. Chu

Dynamic Voltage and Frequency Scaling (D VFS) on General-Purpose Graphics Processing Units (GPGPUs) is now becoming one of the most significant techniques to balance computational performance and energy consumption. However, there are still few fast and accurate models for predicting GPU kernel execution time under different core and memory frequency settings, which is important to determine the best frequency configuration for energy saving. Accordingly, a novel GPGPU performance estimation model with both core and memory frequency scaling is herein proposed. We design a cross-benchmarking suite, which simulates kernels with a wide range of instruction distributions. The synthetic kernels generated by this suite can be used for model pre-training or as supplementary training samples. Then we apply two different machine learning algorithms, Support Vector Regression (SVR) and Gradient Boosting Decision Tree (GBDT), to study the correlation between kernel performance counters and kernel performance. The models trained only with our cross-benchmarking suite achieve satisfying accuracy (16%~22% mean absolute error) on 24 unseen real application kernels. Validated on three modern GPUs with a wide frequency scaling range, by using a collection of 24 real application kernels, the proposed model is able to achieve accurate results (5.1%, 2.8%, 6.5% mean absolute error) for the target GPUs (GTX 980, Titan X Pascal and Tesla P100).

通用图形处理单元(gpgpu)上的动态电压和频率缩放(D VFS)目前已成为平衡计算性能和能耗的最重要技术之一。然而，在不同的内核和内存频率设置下，预测GPU内核执行时间的快速和准确的模型仍然很少，这对于确定最佳的节能频率配置至关重要。在此基础上，提出了一种同时考虑内核和内存频率缩放的GPGPU性能估计模型。我们设计了一个交叉基准测试套件，它模拟了具有广泛指令分布的内核。该套件生成的合成核可用于模型预训练或作为补充训练样本。然后，我们应用两种不同的机器学习算法，支持向量回归(SVR)和梯度提升决策树(GBDT)，研究核性能计数器与核性能之间的相关性。仅使用我们的交叉基准测试套件训练的模型在24个未见过的实际应用内核上获得了令人满意的准确率(平均绝对误差为16%~22%)。通过在三种具有宽频率缩放范围的现代gpu上进行验证，通过使用24个实际应用内核，所提出的模型能够对目标gpu (GTX 980, Titan X Pascal和Tesla P100)获得准确的结果(平均绝对误差为5.1%，2.8%和6.5%)。

{"title":"GPGPU performance estimation for frequency scaling using cross-benchmarking","authors":"Qiang Wang, Chengjian Liu, X. Chu","doi":"10.1145/3366428.3380767","DOIUrl":"https://doi.org/10.1145/3366428.3380767","url":null,"abstract":"Dynamic Voltage and Frequency Scaling (D VFS) on General-Purpose Graphics Processing Units (GPGPUs) is now becoming one of the most significant techniques to balance computational performance and energy consumption. However, there are still few fast and accurate models for predicting GPU kernel execution time under different core and memory frequency settings, which is important to determine the best frequency configuration for energy saving. Accordingly, a novel GPGPU performance estimation model with both core and memory frequency scaling is herein proposed. We design a cross-benchmarking suite, which simulates kernels with a wide range of instruction distributions. The synthetic kernels generated by this suite can be used for model pre-training or as supplementary training samples. Then we apply two different machine learning algorithms, Support Vector Regression (SVR) and Gradient Boosting Decision Tree (GBDT), to study the correlation between kernel performance counters and kernel performance. The models trained only with our cross-benchmarking suite achieve satisfying accuracy (16%~22% mean absolute error) on 24 unseen real application kernels. Validated on three modern GPUs with a wide frequency scaling range, by using a collection of 24 real application kernels, the proposed model is able to achieve accurate results (5.1%, 2.8%, 6.5% mean absolute error) for the target GPUs (GTX 980, Titan X Pascal and Tesla P100).","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124780926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Automatic generation of specialized direct convolutions for mobile GPUs 移动gpu专用直接卷积的自动生成

Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pub Date : 2020-02-19 DOI: 10.1145/3366428.3380771

Naums Mogers, Valentin Radu, Lu Li, Jack Turner, M. O’Boyle, Christophe Dubach

Convolutional Neural Networks (CNNs) are a powerful and versatile tool for performing computer vision tasks in both resource constrained settings and server-side applications. Most GPU hardware vendors provide highly tuned libraries for CNNs such as Nvidia's cuDNN or ARM Compute Library. Such libraries are the basis for higher-level, commonly-used, machine-learning frameworks such as PyTorch or Caffe, abstracting them away from vendor-specific implementation details. However, writing optimized parallel code for GPUs is far from trivial. This places a significant burden on hardware-specific library writers which have to continually play catch-up with rapid hardware and network evolution. To reduce effort and reduce time to market, new approaches are needed based on automatic code generation, rather than manual implementation. This paper describes such an approach for direct convolutions using Lift, a new data-parallel intermediate language and compiler. Lift uses a high-level intermediate language to express algorithms which are then automatically optimized using a system of rewrite-rules. Direct convolution, as opposed to the matrix multiplication approach used commonly by machine-learning frameworks, uses an order of magnitude less memory, which is critical for mobile devices. Using Lift, we show that it is possible to generate automatically code that is X10 faster than the direct convolution while using X3.6 less space than the GEMM-based convolution of the very specialized ARM Compute Library on the latest generation of ARM Mali GPU.

卷积神经网络(cnn)是一种强大而通用的工具，用于在资源受限的设置和服务器端应用程序中执行计算机视觉任务。大多数GPU硬件供应商为cnn提供高度调优的库，如Nvidia的cuDNN或ARM计算库。这些库是高级、常用的机器学习框架(如PyTorch或Caffe)的基础，将它们从特定于供应商的实现细节中抽象出来。然而，为gpu编写优化的并行代码远非易事。这给特定于硬件的库编写者带来了沉重的负担，他们必须不断地追赶硬件和网络的快速发展。为了减少工作量和缩短上市时间，需要基于自动代码生成的新方法，而不是手动实现。本文用一种新的数据并行中间语言和编译器Lift描述了这种直接卷积的方法。Lift使用高级中间语言来表达算法，然后使用重写规则系统自动优化算法。与机器学习框架通常使用的矩阵乘法方法相反，直接卷积使用的内存少了一个数量级，这对移动设备至关重要。使用Lift，我们展示了自动生成比直接卷积快X10的代码是可能的，而使用X3.6的空间比最新一代ARM Mali GPU上非常专业的ARM计算库的基于gem的卷积少。

{"title":"Automatic generation of specialized direct convolutions for mobile GPUs","authors":"Naums Mogers, Valentin Radu, Lu Li, Jack Turner, M. O’Boyle, Christophe Dubach","doi":"10.1145/3366428.3380771","DOIUrl":"https://doi.org/10.1145/3366428.3380771","url":null,"abstract":"Convolutional Neural Networks (CNNs) are a powerful and versatile tool for performing computer vision tasks in both resource constrained settings and server-side applications. Most GPU hardware vendors provide highly tuned libraries for CNNs such as Nvidia's cuDNN or ARM Compute Library. Such libraries are the basis for higher-level, commonly-used, machine-learning frameworks such as PyTorch or Caffe, abstracting them away from vendor-specific implementation details. However, writing optimized parallel code for GPUs is far from trivial. This places a significant burden on hardware-specific library writers which have to continually play catch-up with rapid hardware and network evolution. To reduce effort and reduce time to market, new approaches are needed based on automatic code generation, rather than manual implementation. This paper describes such an approach for direct convolutions using Lift, a new data-parallel intermediate language and compiler. Lift uses a high-level intermediate language to express algorithms which are then automatically optimized using a system of rewrite-rules. Direct convolution, as opposed to the matrix multiplication approach used commonly by machine-learning frameworks, uses an order of magnitude less memory, which is critical for mobile devices. Using Lift, we show that it is possible to generate automatically code that is X10 faster than the direct convolution while using X3.6 less space than the GEMM-based convolution of the very specialized ARM Compute Library on the latest generation of ARM Mali GPU.","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129208430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

High-level hardware feature extraction for GPU performance prediction of stencils 基于GPU性能预测的高级硬件特征提取

Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pub Date : 2020-02-19 DOI: 10.1145/3366428.3380769

Toomas Remmelg, Bastian Hagedorn, Lu Li, Michel Steuwer, S. Gorlatch, Christophe Dubach

High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these high-level functional abstractions can also be used to represent programs and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs. Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The first approach has fast compile times, however, it is not performance-portable across different devices and requires a lot of human effort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to different devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features. Using the Lift framework, this paper demonstrates how low-level, GPU-specific features are extractable directly from a high-level functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speedup the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.

高级函数式编程抽象已经开始在高性能计算(HPC)中显示出有希望的结果。Lift、Futhark或Delite等方法已经表明，即使对于像模板这样的高性能计算工作负载，也可以同时拥有高级抽象和性能。此外，这些高级函数抽象也可用于在编译器本身中表示程序及其优化的变体。然而，这种高级方法严重依赖编译器来优化程序，这在针对gpu时是出了名的困难。编译器要么使用手工设计的启发式来指导优化，要么使用迭代编译来搜索优化空间。第一种方法的编译时间很快，但是，它不能跨不同设备进行性能移植，并且需要大量人力来构建启发式方法。迭代编译则具有自动搜索优化空间的能力，能够适应不同的设备。然而，这个过程通常非常耗时，因为必须评估数千个变体。提出了基于统计技术的性能模型，以加快优化空间探索。然而，它们依赖于低级硬件特性，以性能计数器或低级静态代码特性的形式。使用Lift框架，本文演示了如何直接从高级功能表示中提取低级的、特定于gpu的特性。提升IR(中间表示)实际上是一个非常合适的选择，因为所有的优化选择都暴露在IR级别。本文展示了如何提取低级特征，例如每次warp访问的唯一缓存行数量，这对于构建准确的GPU性能模型至关重要。使用这种方法，我们能够在许多模板应用程序中，在AMD GPU上加速2000倍的空间探索，在Nvidia上平均加速450倍。

{"title":"High-level hardware feature extraction for GPU performance prediction of stencils","authors":"Toomas Remmelg, Bastian Hagedorn, Lu Li, Michel Steuwer, S. Gorlatch, Christophe Dubach","doi":"10.1145/3366428.3380769","DOIUrl":"https://doi.org/10.1145/3366428.3380769","url":null,"abstract":"High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these high-level functional abstractions can also be used to represent programs and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs. Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The first approach has fast compile times, however, it is not performance-portable across different devices and requires a lot of human effort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to different devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features. Using the Lift framework, this paper demonstrates how low-level, GPU-specific features are extractable directly from a high-level functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speedup the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133458142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Automated test generation for OpenCL kernels using fuzzing and constraint solving 使用模糊测试和约束求解自动生成OpenCL内核测试

Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pub Date : 2020-02-19 DOI: 10.1145/3366428.3380768

Chao Peng, A. Rajan

Graphics Processing Units (GPUs) are massively parallel processors offering performance acceleration and energy efficiency unmatched by current processors (CPUs) in computers. These advantages along with recent advances in the programmability of GPUs have made them attractive for general-purpose computations. Despite the advances in programmability, GPU kernels are hard to code and analyse due to the high complexity of memory sharing patterns, striding patterns for memory accesses, implicit synchronisation, and combinatorial explosion of thread interleavings. Existing few techniques for testing GPU kernels use symbolic execution for test generation that incur a high overhead, have limited scalability and do not handle all data types. We propose a test generation technique for OpenCL kernels that combines mutation-based fuzzing and selective constraint solving with the goal of being fast, effective and scalable. Fuzz testing for GPU kernels has not been explored previously. Our approach for fuzz testing randomly mutates input kernel argument values with the goal of increasing branch coverage. When fuzz testing is unable to increase branch coverage with random mutations, we gather path constraints for uncovered branch conditions and invoke the Z3 constraint solver to generate tests for them. In addition to the test generator, we also present a schedule amplifier that simulates multiple work-group schedules, with which to execute each of the generated tests. The schedule amplifier is designed to help uncover inter work-group data races. We evaluate the effectiveness of the generated tests and schedule amplifier using 217 kernels from open source projects and industry standard benchmark suites measuring branch coverage and fault finding. We find our test generation technique achieves close to 100% coverage and mutation score for majority of the kernels. Overhead incurred in test generation is small (average of 0.8 seconds). We also confirmed our technique scales easily to large kernels, and can support all OpenCL data types, including complex data structures.

图形处理单元(gpu)是大规模并行处理器，提供当前计算机处理器(cpu)无法比拟的性能加速和能效。这些优点以及gpu可编程性的最新进展使它们对通用计算具有吸引力。尽管在可编程性方面取得了进步，但由于内存共享模式、内存访问的跨行模式、隐式同步和线程交错的组合爆炸的高度复杂性，GPU内核很难编码和分析。现有的几种测试GPU内核的技术使用符号执行来生成测试，这会产生很高的开销，具有有限的可伸缩性，并且不能处理所有数据类型。我们提出了一种针对OpenCL内核的测试生成技术，该技术结合了基于突变的模糊和选择性约束求解，以实现快速、有效和可扩展的目标。GPU内核的模糊测试以前没有被探索过。我们的模糊测试方法随机改变输入的内核参数值，目标是增加分支覆盖率。当模糊测试无法通过随机突变增加分支覆盖率时，我们为未覆盖的分支条件收集路径约束，并调用Z3约束求解器为它们生成测试。除了测试生成器之外，我们还提供了一个计划放大器，它模拟多个工作组计划，用它来执行每个生成的测试。计划放大器的设计是为了帮助发现工作组之间的数据竞争。我们使用来自开源项目的217个内核和衡量分支覆盖率和故障查找的行业标准基准套件来评估生成的测试和调度放大器的有效性。我们发现我们的测试生成技术对大多数内核实现了接近100%的覆盖率和突变分数。在测试生成中产生的开销很小(平均为0.8秒)。我们还证实，我们的技术可以很容易地扩展到大型内核，并且可以支持所有OpenCL数据类型，包括复杂的数据结构。

{"title":"Automated test generation for OpenCL kernels using fuzzing and constraint solving","authors":"Chao Peng, A. Rajan","doi":"10.1145/3366428.3380768","DOIUrl":"https://doi.org/10.1145/3366428.3380768","url":null,"abstract":"Graphics Processing Units (GPUs) are massively parallel processors offering performance acceleration and energy efficiency unmatched by current processors (CPUs) in computers. These advantages along with recent advances in the programmability of GPUs have made them attractive for general-purpose computations. Despite the advances in programmability, GPU kernels are hard to code and analyse due to the high complexity of memory sharing patterns, striding patterns for memory accesses, implicit synchronisation, and combinatorial explosion of thread interleavings. Existing few techniques for testing GPU kernels use symbolic execution for test generation that incur a high overhead, have limited scalability and do not handle all data types. We propose a test generation technique for OpenCL kernels that combines mutation-based fuzzing and selective constraint solving with the goal of being fast, effective and scalable. Fuzz testing for GPU kernels has not been explored previously. Our approach for fuzz testing randomly mutates input kernel argument values with the goal of increasing branch coverage. When fuzz testing is unable to increase branch coverage with random mutations, we gather path constraints for uncovered branch conditions and invoke the Z3 constraint solver to generate tests for them. In addition to the test generator, we also present a schedule amplifier that simulates multiple work-group schedules, with which to execute each of the generated tests. The schedule amplifier is designed to help uncover inter work-group data races. We evaluate the effectiveness of the generated tests and schedule amplifier using 217 kernels from open source projects and industry standard benchmark suites measuring branch coverage and fault finding. We find our test generation technique achieves close to 100% coverage and mutation score for majority of the kernels. Overhead incurred in test generation is small (average of 0.8 seconds). We also confirmed our technique scales easily to large kernels, and can support all OpenCL data types, including complex data structures.","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"248 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133898783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Unveiling kernel concurrency in multiresolution filters on GPUs with an image processing DSL 利用图像处理DSL揭示gpu上多分辨率滤波器的内核并发性

Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pub Date : 2020-02-19 DOI: 10.1145/3366428.3380773

Bo Qiao, Oliver Reiche, J. Teich, Frank Hannig

Multiresolution filters, analyzing information at different scales, are crucial for many applications in digital image processing. The different space and time complexity at distinct scales in the unique pyramidal structure poses a challenge as well as an opportunity to implementations on modern accelerators such as GPUs with an increasing number of compute units. In this paper, we exploit the potential of concurrent kernel execution in multiresolution filters. As a major contribution, we present a model-based approach for performance analysis of as well single- as multi-stream implementations, combining both application- and architecture-specific knowledge. As a second contribution, the involved transformations and code generators using CUDA streams on Nvidia GPUs have been integrated into a compiler-based approach using an image processing DSL called Hipacc. We then apply our approach to evaluate and compare the achieved performance for four real-world applications on three GPUs. The results show that our method can achieve a geometric mean speedup of up to 2.5 over the original Hipacc implementation without our approach, up to 2.0 over the other state-of-the-art DSL Halide, and up to 1.3 over the recently released programming model CUDA Graph from Nvidia.

多分辨率滤波器分析不同尺度的信息，在数字图像处理的许多应用中都是至关重要的。在独特的金字塔结构中，不同尺度上的不同空间和时间复杂性对具有越来越多计算单元的现代加速器(如gpu)的实现提出了挑战，同时也带来了机遇。在本文中，我们开发了在多分辨率过滤器中并发内核执行的潜力。作为主要贡献，我们提出了一种基于模型的方法，用于单流和多流实现的性能分析，结合了特定于应用程序和体系结构的知识。第二个贡献是，使用Nvidia gpu上CUDA流的相关转换和代码生成器已经集成到使用称为Hipacc的图像处理DSL的基于编译器的方法中。然后，我们应用我们的方法来评估和比较在三个gpu上实现的四个实际应用程序的性能。结果表明，在没有我们的方法的情况下，我们的方法可以实现比原始Hipacc实现高达2.5的几何平均加速，比其他最先进的DSL Halide高达2.0，比Nvidia最近发布的编程模型CUDA Graph高达1.3。

{"title":"Unveiling kernel concurrency in multiresolution filters on GPUs with an image processing DSL","authors":"Bo Qiao, Oliver Reiche, J. Teich, Frank Hannig","doi":"10.1145/3366428.3380773","DOIUrl":"https://doi.org/10.1145/3366428.3380773","url":null,"abstract":"Multiresolution filters, analyzing information at different scales, are crucial for many applications in digital image processing. The different space and time complexity at distinct scales in the unique pyramidal structure poses a challenge as well as an opportunity to implementations on modern accelerators such as GPUs with an increasing number of compute units. In this paper, we exploit the potential of concurrent kernel execution in multiresolution filters. As a major contribution, we present a model-based approach for performance analysis of as well single- as multi-stream implementations, combining both application- and architecture-specific knowledge. As a second contribution, the involved transformations and code generators using CUDA streams on Nvidia GPUs have been integrated into a compiler-based approach using an image processing DSL called Hipacc. We then apply our approach to evaluate and compare the achieved performance for four real-world applications on three GPUs. The results show that our method can achieve a geometric mean speedup of up to 2.5 over the original Hipacc implementation without our approach, up to 2.0 over the other state-of-the-art DSL Halide, and up to 1.3 over the recently released programming model CUDA Graph from Nvidia.","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126857274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Custom code generation for a graph DSL 图形DSL的自定义代码生成

Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pub Date : 2019-03-05 DOI: 10.1145/3366428.3380772

Bikash Gogoi, Unnikrishnan Cheramangalath, R. Nasre

We present challenges faced in making a domain-specific language (DSL) for graph algorithms adapt to varying requirements of generating a spectrum of efficient parallel codes. Graph algorithms are at the heart of several applications, and achieving high performance on graph applications has become critical due to the tremendous growth of irregular data. However, irregular algorithms are quite challenging to auto-parallelize, due to access patterns influenced by the input graph - which is unavailable until execution. Former research has addressed this issue by designing DSLs for graph algorithms, which restrict generality but allow efficient codegeneration for various backends. Such DSLs are, however, too rigid, and do not adapt to changes. For instance, these DSLs are incapable of changing the way of processing if the underlying graph changes. As another instance, most of the DSLs do not support more than one backends. We narrate our experiences in making an existing DSL, named Falcon, adaptive. The biggest challenge in the process is to retain the DSL code for specifying the underlying algorithm, and still generate different backend codes. We illustrate the effectiveness of our proposal by auto-generating codes for vertex-based versus edge-based graph processing, synchronous versus asynchronous execution, and CPU versus GPU backends from the same specification.

我们提出了在使图形算法的领域特定语言(DSL)适应生成一系列高效并行代码的不同需求方面所面临的挑战。图算法是许多应用程序的核心，由于不规则数据的巨大增长，在图应用程序上实现高性能变得至关重要。然而，由于受输入图影响的访问模式(在执行之前是不可用的)，不规则算法在自动并行化方面相当具有挑战性。以前的研究通过为图算法设计dsl来解决这个问题，这限制了通用性，但允许对各种后端进行有效的编码退化。然而，这样的dsl过于死板，不能适应变化。例如，如果底层图发生变化，这些dsl无法改变处理方式。作为另一个实例，大多数dsl不支持多个后端。我们叙述了我们的经验，使现有的DSL，名为猎鹰，自适应。这个过程中最大的挑战是保留用于指定底层算法的DSL代码，同时仍然生成不同的后端代码。我们通过自动生成基于顶点与基于边缘的图形处理、同步与异步执行以及来自同一规范的CPU与GPU后端代码来说明我们建议的有效性。

{"title":"Custom code generation for a graph DSL","authors":"Bikash Gogoi, Unnikrishnan Cheramangalath, R. Nasre","doi":"10.1145/3366428.3380772","DOIUrl":"https://doi.org/10.1145/3366428.3380772","url":null,"abstract":"We present challenges faced in making a domain-specific language (DSL) for graph algorithms adapt to varying requirements of generating a spectrum of efficient parallel codes. Graph algorithms are at the heart of several applications, and achieving high performance on graph applications has become critical due to the tremendous growth of irregular data. However, irregular algorithms are quite challenging to auto-parallelize, due to access patterns influenced by the input graph - which is unavailable until execution. Former research has addressed this issue by designing DSLs for graph algorithms, which restrict generality but allow efficient codegeneration for various backends. Such DSLs are, however, too rigid, and do not adapt to changes. For instance, these DSLs are incapable of changing the way of processing if the underlying graph changes. As another instance, most of the DSLs do not support more than one backends. We narrate our experiences in making an existing DSL, named Falcon, adaptive. The biggest challenge in the process is to retain the DSL code for specifying the underlying algorithm, and still generate different backend codes. We illustrate the effectiveness of our proposal by auto-generating codes for vertex-based versus edge-based graph processing, synchronous versus asynchronous execution, and CPU versus GPU backends from the same specification.","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122286335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀