Exposing Hidden Performance Opportunities in High Performance GPU Applications

Benjamin Welton, B. Miller
{"title":"Exposing Hidden Performance Opportunities in High Performance GPU Applications","authors":"Benjamin Welton, B. Miller","doi":"10.1109/CCGRID.2018.00045","DOIUrl":null,"url":null,"abstract":"Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"494 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2018.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
揭示高性能GPU应用程序中隐藏的性能机会
具有包含多核加速器(如gpu)的节点的领导类系统具有提高应用程序性能的潜力。有效地利用多核加速器提供的并行性要求开发人员确定加速器并行化将在何处提供好处,并确保CPU和加速器之间的有效交互。抽象地说,这些问题似乎很简单,也很容易理解。然而,我们发现在这些领域中存在着大量未开发的性能机会,即使是在由经验丰富的GPU开发人员创建的知名的、经过大量优化的现实世界应用程序中也是如此。这些未开发的性能机会之所以存在,是因为加速库可能会产生意想不到的同步延迟和内存传输请求,加速库之间的交互在组合时可能会导致意想不到的低效率,并且向量化机会可能被程序的结构所隐藏。在我们研究的应用程序(Qball、QBox、hood -blue、lamp和cuIBM)中,利用这些机会可以将它们的执行时间减少18%-87%。在这项工作中,我们提供了具体的证据,证明这些性能问题的存在及其对当今现实世界应用程序的影响。我们通过其潜在原因对错失的性能机会进行了描述,并描述了性能工具可以使用的检测方法的初步设计,以识别这些错失的机会。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Extreme-Scale Realistic Stencil Computations on Sunway TaihuLight with Ten Million Cores RideMatcher: Peer-to-Peer Matching of Passengers for Efficient Ridesharing Nitro: Network-Aware Virtual Machine Image Management in Geo-Distributed Clouds Improving Energy Efficiency of Database Clusters Through Prefetching and Caching Main-Memory Requirements of Big Data Applications on Commodity Server Platform
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1