Exposing Hidden Performance Opportunities in High Performance GPU Applications

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2018-05-01 DOI:10.1109/CCGRID.2018.00045

Benjamin Welton, B. Miller

{"title":"Exposing Hidden Performance Opportunities in High Performance GPU Applications","authors":"Benjamin Welton, B. Miller","doi":"10.1109/CCGRID.2018.00045","DOIUrl":null,"url":null,"abstract":"Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"494 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2018.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

揭示高性能GPU应用程序中隐藏的性能机会

具有包含多核加速器(如gpu)的节点的领导类系统具有提高应用程序性能的潜力。有效地利用多核加速器提供的并行性要求开发人员确定加速器并行化将在何处提供好处，并确保CPU和加速器之间的有效交互。抽象地说，这些问题似乎很简单，也很容易理解。然而，我们发现在这些领域中存在着大量未开发的性能机会，即使是在由经验丰富的GPU开发人员创建的知名的、经过大量优化的现实世界应用程序中也是如此。这些未开发的性能机会之所以存在，是因为加速库可能会产生意想不到的同步延迟和内存传输请求，加速库之间的交互在组合时可能会导致意想不到的低效率，并且向量化机会可能被程序的结构所隐藏。在我们研究的应用程序(Qball、QBox、hood -blue、lamp和cuIBM)中，利用这些机会可以将它们的执行时间减少18%-87%。在这项工作中，我们提供了具体的证据，证明这些性能问题的存在及其对当今现实世界应用程序的影响。我们通过其潜在原因对错失的性能机会进行了描述，并描述了性能工具可以使用的检测方法的初步设计，以识别这些错失的机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量