Program Optimization of Stencil Based Application on the GPU-Accelerated System

Guibin Wang, Xuejun Yang, Y. Zhang, T. Tang, Xudong Fang
{"title":"Program Optimization of Stencil Based Application on the GPU-Accelerated System","authors":"Guibin Wang, Xuejun Yang, Y. Zhang, T. Tang, Xudong Fang","doi":"10.1109/ISPA.2009.70","DOIUrl":null,"url":null,"abstract":"Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a useful architectural evaluation benchmark. In this paper, we propose some memory optimizations for a stencil based application mgrid from SPEC 2K benchmarks. Through exploiting data locality in 3-level memory hierarchies and tuning the thread granularity, we reduce the pressure on the off-chip memory bandwidth. To hide the long off-chip memory access latency, we further prefetch data during computation through double-buffer. In order to fully exploit the CPU-GPU heterogeneous system, we redistribute the computation between these two computing resource. Through all these optimizations, we gain 24.2x speedup compared to the simple mapping version, and get as high as 34.3x speedup when compared with a CPU implementation.","PeriodicalId":346815,"journal":{"name":"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPA.2009.70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a useful architectural evaluation benchmark. In this paper, we propose some memory optimizations for a stencil based application mgrid from SPEC 2K benchmarks. Through exploiting data locality in 3-level memory hierarchies and tuning the thread granularity, we reduce the pressure on the off-chip memory bandwidth. To hide the long off-chip memory access latency, we further prefetch data during computation through double-buffer. In order to fully exploit the CPU-GPU heterogeneous system, we redistribute the computation between these two computing resource. Through all these optimizations, we gain 24.2x speedup compared to the simple mapping version, and get as high as 34.3x speedup when compared with a CPU implementation.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于模板的gpu加速系统应用程序优化
图形处理单元(GPU)具有许多轻量级的数据并行核,可以提供大量的并行计算能力来加速通用应用程序。但受片外内存带宽和延迟的限制,强大的计算能力无法充分发挥到内存密集型应用中。模板计算具有丰富的并行性和较低的计算强度,是一种有用的体系结构评价基准。在本文中,我们从spec2k基准测试中为基于模板的应用程序mgrid提出了一些内存优化。通过利用3级内存层次结构中的数据局部性和调优线程粒度,我们减少了片外内存带宽的压力。为了隐藏片外存储器访问延迟,我们进一步通过双缓冲区在计算过程中预取数据。为了充分利用CPU-GPU异构系统,我们在这两种计算资源之间重新分配计算。通过所有这些优化,与简单的映射版本相比,我们获得了24.2倍的加速,与CPU实现相比,我们获得了高达34.3倍的加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Completion Time Estimation for Instances of Generalized Well-Formed Workflow A Synchronization-Based Alternative to Directory Protocol Web Service Locating Unit in RFID-Centric Anti-counterfeit System Distributed Transfer Network Learning Based Intrusion Detection Multi-Source Traffic Data Fusion Method Based on Regulation and Reliability
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1