GPU的热情正在消失吗?

2012 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2012-07-02 DOI:10.1109/HPCSim.2012.6266945

C. Trinitis

{"title":"GPU的热情正在消失吗?","authors":"C. Trinitis","doi":"10.1109/HPCSim.2012.6266945","DOIUrl":null,"url":null,"abstract":"In recent years, there has been quite a hype on porting compute intensive kernels to GPUs, claiming impressive speedups of sometimes up to more than 100. However, looking at a number of compute intensive applications that have been investigated at TUM, the outcome looks slightly different. In addition, the overhead for porting applications to GPUs, or, more generally speaking, to accelerators, need be taken into consideration. As both very promising and very disappointing results can be obtained on accelerators (depending on the application), as usual the community is divided into GPU enthusiasts on the one hand and GPU opponents on the other hand. In both industrial and academic practice, the question arises what to do with existing compute intensive applications (often numerical simulation codes) that have existed for years or even decades, and which are treated as “never change a running system” code. Basically, these can be divided into three categories: - Code that should not be touched as it most likely will no longer run if anything will be modified (complete rewrite required if it is to run efficiently) - Code where compute intensive parts can be rewritten (partial rewrite required), and - Code that can easily be ported to new programming paradigms (easy adapting possible). Given the fact that CPUs integrate more and more features known from accelerators, one could conclude that most codes would fall into the third category, as the required porting effort seems to be shrinking and compilers are constantly improving. However, although features like automatic parallelization can be carried out with compilers, tuning by hand coding or using hardware specific programming paradigms still outperforms generic approaches. As GPU enthusiasts are mainly keen on using CUDA (with some of them moving to OpenCL), GPU opponents claim that by hardcore optimization of compute intensive numerical code, CPUs can reach equal or even better results than accelerators, hence taking vector units operating on AVX registers as on chip accelerators. In order to satisfy both CPU and accelerator programmers, it is still not clear which programming interface will eventually turn out to become a de facto standard. Next to GPUs by NVIDIA and AMD, another interesting approach in the accelerator world is Intel's MIC architecture, with a couple of supercomputing projects already being built around this architecture. As it is based on the x86 ISA including the full tool chain from compilers to debuggers to performance analysis tools, MIC aims at minimizing porting effort to accelerators from the programmer's point of view. The talk will present examples from high performance computing that fall into the three abovementioned categories, and how these code examples have been adapted to modern processor and accelerator architectures.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Is GPU enthusiasm vanishing?\",\"authors\":\"C. Trinitis\",\"doi\":\"10.1109/HPCSim.2012.6266945\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, there has been quite a hype on porting compute intensive kernels to GPUs, claiming impressive speedups of sometimes up to more than 100. However, looking at a number of compute intensive applications that have been investigated at TUM, the outcome looks slightly different. In addition, the overhead for porting applications to GPUs, or, more generally speaking, to accelerators, need be taken into consideration. As both very promising and very disappointing results can be obtained on accelerators (depending on the application), as usual the community is divided into GPU enthusiasts on the one hand and GPU opponents on the other hand. In both industrial and academic practice, the question arises what to do with existing compute intensive applications (often numerical simulation codes) that have existed for years or even decades, and which are treated as “never change a running system” code. Basically, these can be divided into three categories: - Code that should not be touched as it most likely will no longer run if anything will be modified (complete rewrite required if it is to run efficiently) - Code where compute intensive parts can be rewritten (partial rewrite required), and - Code that can easily be ported to new programming paradigms (easy adapting possible). Given the fact that CPUs integrate more and more features known from accelerators, one could conclude that most codes would fall into the third category, as the required porting effort seems to be shrinking and compilers are constantly improving. However, although features like automatic parallelization can be carried out with compilers, tuning by hand coding or using hardware specific programming paradigms still outperforms generic approaches. As GPU enthusiasts are mainly keen on using CUDA (with some of them moving to OpenCL), GPU opponents claim that by hardcore optimization of compute intensive numerical code, CPUs can reach equal or even better results than accelerators, hence taking vector units operating on AVX registers as on chip accelerators. In order to satisfy both CPU and accelerator programmers, it is still not clear which programming interface will eventually turn out to become a de facto standard. Next to GPUs by NVIDIA and AMD, another interesting approach in the accelerator world is Intel's MIC architecture, with a couple of supercomputing projects already being built around this architecture. As it is based on the x86 ISA including the full tool chain from compilers to debuggers to performance analysis tools, MIC aims at minimizing porting effort to accelerators from the programmer's point of view. The talk will present examples from high performance computing that fall into the three abovementioned categories, and how these code examples have been adapted to modern processor and accelerator architectures.\",\"PeriodicalId\":428764,\"journal\":{\"name\":\"2012 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCSim.2012.6266945\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2012.6266945","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近年来，将计算密集型内核移植到gpu上的宣传相当热烈，声称有时高达100以上的速度令人印象深刻。然而，看看TUM研究的许多计算密集型应用程序，结果看起来略有不同。此外，需要考虑将应用程序移植到gpu，或者更一般地说，移植到加速器的开销。由于在加速器上可以获得非常有希望和非常令人失望的结果(取决于应用程序)，因此社区通常分为GPU爱好者和GPU反对者。在工业和学术实践中，出现了一个问题，即如何处理已经存在数年甚至数十年的现有计算密集型应用程序(通常是数值模拟代码)，这些应用程序被视为“永远不会改变正在运行的系统”代码。基本上，这些代码可以分为三类:-不应该触碰的代码，因为如果修改任何内容，它很可能不再运行(如果要有效运行，则需要完全重写)-计算密集型部分可以重写的代码(需要部分重写)，以及-可以轻松移植到新编程范式的代码(容易适应)。考虑到cpu集成了越来越多来自加速器的特性，人们可以得出结论，大多数代码将属于第三类，因为所需的移植工作似乎正在减少，编译器也在不断改进。然而，尽管像自动并行化这样的特性可以用编译器来实现，但是通过手工编码或使用特定于硬件的编程范例进行调优仍然优于通用方法。由于GPU爱好者主要热衷于使用CUDA(其中一些人转向了OpenCL)， GPU的反对者声称，通过对计算密集型数字代码的硬核优化，cpu可以达到与加速器相同甚至更好的结果，因此在AVX寄存器上操作的矢量单元就像在芯片加速器上一样。为了让CPU和加速器程序员都满意，目前还不清楚哪个编程接口最终会成为事实上的标准。除了英伟达(NVIDIA)和AMD的gpu之外，加速器领域另一个有趣的方法是英特尔的MIC架构，目前已经有几个超级计算项目围绕该架构展开。由于它基于x86 ISA，包括从编译器到调试器再到性能分析工具的完整工具链，MIC旨在从程序员的角度最大限度地减少向加速器移植的工作量。该演讲将展示属于上述三种类别的高性能计算示例，以及如何将这些代码示例应用于现代处理器和加速器架构。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Is GPU enthusiasm vanishing?

In recent years, there has been quite a hype on porting compute intensive kernels to GPUs, claiming impressive speedups of sometimes up to more than 100. However, looking at a number of compute intensive applications that have been investigated at TUM, the outcome looks slightly different. In addition, the overhead for porting applications to GPUs, or, more generally speaking, to accelerators, need be taken into consideration. As both very promising and very disappointing results can be obtained on accelerators (depending on the application), as usual the community is divided into GPU enthusiasts on the one hand and GPU opponents on the other hand. In both industrial and academic practice, the question arises what to do with existing compute intensive applications (often numerical simulation codes) that have existed for years or even decades, and which are treated as “never change a running system” code. Basically, these can be divided into three categories: - Code that should not be touched as it most likely will no longer run if anything will be modified (complete rewrite required if it is to run efficiently) - Code where compute intensive parts can be rewritten (partial rewrite required), and - Code that can easily be ported to new programming paradigms (easy adapting possible). Given the fact that CPUs integrate more and more features known from accelerators, one could conclude that most codes would fall into the third category, as the required porting effort seems to be shrinking and compilers are constantly improving. However, although features like automatic parallelization can be carried out with compilers, tuning by hand coding or using hardware specific programming paradigms still outperforms generic approaches. As GPU enthusiasts are mainly keen on using CUDA (with some of them moving to OpenCL), GPU opponents claim that by hardcore optimization of compute intensive numerical code, CPUs can reach equal or even better results than accelerators, hence taking vector units operating on AVX registers as on chip accelerators. In order to satisfy both CPU and accelerator programmers, it is still not clear which programming interface will eventually turn out to become a de facto standard. Next to GPUs by NVIDIA and AMD, another interesting approach in the accelerator world is Intel's MIC architecture, with a couple of supercomputing projects already being built around this architecture. As it is based on the x86 ISA including the full tool chain from compilers to debuggers to performance analysis tools, MIC aims at minimizing porting effort to accelerators from the programmer's point of view. The talk will present examples from high performance computing that fall into the three abovementioned categories, and how these code examples have been adapted to modern processor and accelerator architectures.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量