Unveiling kernel concurrency in multiresolution filters on GPUs with an image processing DSL

Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit Pub Date : 2020-02-19 DOI:10.1145/3366428.3380773

Bo Qiao, Oliver Reiche, J. Teich, Frank Hannig

{"title":"Unveiling kernel concurrency in multiresolution filters on GPUs with an image processing DSL","authors":"Bo Qiao, Oliver Reiche, J. Teich, Frank Hannig","doi":"10.1145/3366428.3380773","DOIUrl":null,"url":null,"abstract":"Multiresolution filters, analyzing information at different scales, are crucial for many applications in digital image processing. The different space and time complexity at distinct scales in the unique pyramidal structure poses a challenge as well as an opportunity to implementations on modern accelerators such as GPUs with an increasing number of compute units. In this paper, we exploit the potential of concurrent kernel execution in multiresolution filters. As a major contribution, we present a model-based approach for performance analysis of as well single- as multi-stream implementations, combining both application- and architecture-specific knowledge. As a second contribution, the involved transformations and code generators using CUDA streams on Nvidia GPUs have been integrated into a compiler-based approach using an image processing DSL called Hipacc. We then apply our approach to evaluate and compare the achieved performance for four real-world applications on three GPUs. The results show that our method can achieve a geometric mean speedup of up to 2.5 over the original Hipacc implementation without our approach, up to 2.0 over the other state-of-the-art DSL Halide, and up to 1.3 over the recently released programming model CUDA Graph from Nvidia.","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366428.3380773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Multiresolution filters, analyzing information at different scales, are crucial for many applications in digital image processing. The different space and time complexity at distinct scales in the unique pyramidal structure poses a challenge as well as an opportunity to implementations on modern accelerators such as GPUs with an increasing number of compute units. In this paper, we exploit the potential of concurrent kernel execution in multiresolution filters. As a major contribution, we present a model-based approach for performance analysis of as well single- as multi-stream implementations, combining both application- and architecture-specific knowledge. As a second contribution, the involved transformations and code generators using CUDA streams on Nvidia GPUs have been integrated into a compiler-based approach using an image processing DSL called Hipacc. We then apply our approach to evaluate and compare the achieved performance for four real-world applications on three GPUs. The results show that our method can achieve a geometric mean speedup of up to 2.5 over the original Hipacc implementation without our approach, up to 2.0 over the other state-of-the-art DSL Halide, and up to 1.3 over the recently released programming model CUDA Graph from Nvidia.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用图像处理DSL揭示gpu上多分辨率滤波器的内核并发性

多分辨率滤波器分析不同尺度的信息，在数字图像处理的许多应用中都是至关重要的。在独特的金字塔结构中，不同尺度上的不同空间和时间复杂性对具有越来越多计算单元的现代加速器(如gpu)的实现提出了挑战，同时也带来了机遇。在本文中，我们开发了在多分辨率过滤器中并发内核执行的潜力。作为主要贡献，我们提出了一种基于模型的方法，用于单流和多流实现的性能分析，结合了特定于应用程序和体系结构的知识。第二个贡献是，使用Nvidia gpu上CUDA流的相关转换和代码生成器已经集成到使用称为Hipacc的图像处理DSL的基于编译器的方法中。然后，我们应用我们的方法来评估和比较在三个gpu上实现的四个实际应用程序的性能。结果表明，在没有我们的方法的情况下，我们的方法可以实现比原始Hipacc实现高达2.5的几何平均加速，比其他最先进的DSL Halide高达2.0，比Nvidia最近发布的编程模型CUDA Graph高达1.3。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

自引率

0.00%

发文量