Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI:10.1109/SAAHPC.2011.11

Nicholas Moore, M. Leeser, L. King

{"title":"Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation","authors":"Nicholas Moore, M. Leeser, L. King","doi":"10.1109/SAAHPC.2011.11","DOIUrl":null,"url":null,"abstract":"For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":"71 11","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 Symposium on Application Accelerators in High-Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAAHPC.2011.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

NVIDIA gpu上的可适应二维滑动窗口与运行时编译

对于某些类别的问题，NVIDIA CUDA抽象和硬件属性与问题特征相结合，以限制可以有效加速的特定问题实例。作为一个实际的例子，考虑了一个基于二维相关的模板匹配MATLAB应用程序。对于线性图像过滤的常见情况，这个问题有一个众所周知的解决方案——将已知大小的小固定模板应用于更大的图像——这里考虑的应用程序使用任意大小的大型模板，最大可达156 × 116像素，每个模板的搜索空间不超过703个窗口位置。与运行在3.33 GHz四核Intel Nehalem处理器上的优化多线程实现相比，我们的CUDA实现方法采用模板平纹和针对问题的内核编译来实现高达15%的速度提升。平铺模板可以利用计算和共享内存使用中的并行性。与此同时，特定于问题的内核编译允许更高级别的适应性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 Symposium on Application Accelerators in High-Performance Computing

自引率

0.00%

发文量