Program Optimization of Stencil Based Application on the GPU-Accelerated System

2009 IEEE International Symposium on Parallel and Distributed Processing with Applications Pub Date : 2009-08-18 DOI:10.1109/ISPA.2009.70

Guibin Wang, Xuejun Yang, Y. Zhang, T. Tang, Xudong Fang

{"title":"Program Optimization of Stencil Based Application on the GPU-Accelerated System","authors":"Guibin Wang, Xuejun Yang, Y. Zhang, T. Tang, Xudong Fang","doi":"10.1109/ISPA.2009.70","DOIUrl":null,"url":null,"abstract":"Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a useful architectural evaluation benchmark. In this paper, we propose some memory optimizations for a stencil based application mgrid from SPEC 2K benchmarks. Through exploiting data locality in 3-level memory hierarchies and tuning the thread granularity, we reduce the pressure on the off-chip memory bandwidth. To hide the long off-chip memory access latency, we further prefetch data during computation through double-buffer. In order to fully exploit the CPU-GPU heterogeneous system, we redistribute the computation between these two computing resource. Through all these optimizations, we gain 24.2x speedup compared to the simple mapping version, and get as high as 34.3x speedup when compared with a CPU implementation.","PeriodicalId":346815,"journal":{"name":"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPA.2009.70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a useful architectural evaluation benchmark. In this paper, we propose some memory optimizations for a stencil based application mgrid from SPEC 2K benchmarks. Through exploiting data locality in 3-level memory hierarchies and tuning the thread granularity, we reduce the pressure on the off-chip memory bandwidth. To hide the long off-chip memory access latency, we further prefetch data during computation through double-buffer. In order to fully exploit the CPU-GPU heterogeneous system, we redistribute the computation between these two computing resource. Through all these optimizations, we gain 24.2x speedup compared to the simple mapping version, and get as high as 34.3x speedup when compared with a CPU implementation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于模板的gpu加速系统应用程序优化

图形处理单元(GPU)具有许多轻量级的数据并行核，可以提供大量的并行计算能力来加速通用应用程序。但受片外内存带宽和延迟的限制，强大的计算能力无法充分发挥到内存密集型应用中。模板计算具有丰富的并行性和较低的计算强度，是一种有用的体系结构评价基准。在本文中，我们从spec2k基准测试中为基于模板的应用程序mgrid提出了一些内存优化。通过利用3级内存层次结构中的数据局部性和调优线程粒度，我们减少了片外内存带宽的压力。为了隐藏片外存储器访问延迟，我们进一步通过双缓冲区在计算过程中预取数据。为了充分利用CPU-GPU异构系统，我们在这两种计算资源之间重新分配计算。通过所有这些优化，与简单的映射版本相比，我们获得了24.2倍的加速，与CPU实现相比，我们获得了高达34.3倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2009 IEEE International Symposium on Parallel and Distributed Processing with Applications

自引率

0.00%

发文量

期刊最新文献

Completion Time Estimation for Instances of Generalized Well-Formed Workflow A Synchronization-Based Alternative to Directory Protocol Web Service Locating Unit in RFID-Centric Anti-counterfeit System Distributed Transfer Network Learning Based Intrusion Detection Multi-Source Traffic Data Fusion Method Based on Regulation and Reliability