NVIDIA gpu上的可适应二维滑动窗口与运行时编译

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI:10.1109/SAAHPC.2011.11

Nicholas Moore, M. Leeser, L. King

{"title":"NVIDIA gpu上的可适应二维滑动窗口与运行时编译","authors":"Nicholas Moore, M. Leeser, L. King","doi":"10.1109/SAAHPC.2011.11","DOIUrl":null,"url":null,"abstract":"For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation\",\"authors\":\"Nicholas Moore, M. Leeser, L. King\",\"doi\":\"10.1109/SAAHPC.2011.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible.\",\"PeriodicalId\":331604,\"journal\":{\"name\":\"2011 Symposium on Application Accelerators in High-Performance Computing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 Symposium on Application Accelerators in High-Performance Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SAAHPC.2011.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 Symposium on Application Accelerators in High-Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAAHPC.2011.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

对于某些类别的问题，NVIDIA CUDA抽象和硬件属性与问题特征相结合，以限制可以有效加速的特定问题实例。作为一个实际的例子，考虑了一个基于二维相关的模板匹配MATLAB应用程序。对于线性图像过滤的常见情况，这个问题有一个众所周知的解决方案——将已知大小的小固定模板应用于更大的图像——这里考虑的应用程序使用任意大小的大型模板，最大可达156 × 116像素，每个模板的搜索空间不超过703个窗口位置。与运行在3.33 GHz四核Intel Nehalem处理器上的优化多线程实现相比，我们的CUDA实现方法采用模板平纹和针对问题的内核编译来实现高达15%的速度提升。平铺模板可以利用计算和共享内存使用中的并行性。与此同时，特定于问题的内核编译允许更高级别的适应性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 Symposium on Application Accelerators in High-Performance Computing

自引率

0.00%

发文量