{"title":"GPU的热情正在消失吗?","authors":"C. Trinitis","doi":"10.1109/HPCSim.2012.6266945","DOIUrl":null,"url":null,"abstract":"In recent years, there has been quite a hype on porting compute intensive kernels to GPUs, claiming impressive speedups of sometimes up to more than 100. However, looking at a number of compute intensive applications that have been investigated at TUM, the outcome looks slightly different. In addition, the overhead for porting applications to GPUs, or, more generally speaking, to accelerators, need be taken into consideration. As both very promising and very disappointing results can be obtained on accelerators (depending on the application), as usual the community is divided into GPU enthusiasts on the one hand and GPU opponents on the other hand. In both industrial and academic practice, the question arises what to do with existing compute intensive applications (often numerical simulation codes) that have existed for years or even decades, and which are treated as “never change a running system” code. Basically, these can be divided into three categories: - Code that should not be touched as it most likely will no longer run if anything will be modified (complete rewrite required if it is to run efficiently) - Code where compute intensive parts can be rewritten (partial rewrite required), and - Code that can easily be ported to new programming paradigms (easy adapting possible). Given the fact that CPUs integrate more and more features known from accelerators, one could conclude that most codes would fall into the third category, as the required porting effort seems to be shrinking and compilers are constantly improving. However, although features like automatic parallelization can be carried out with compilers, tuning by hand coding or using hardware specific programming paradigms still outperforms generic approaches. As GPU enthusiasts are mainly keen on using CUDA (with some of them moving to OpenCL), GPU opponents claim that by hardcore optimization of compute intensive numerical code, CPUs can reach equal or even better results than accelerators, hence taking vector units operating on AVX registers as on chip accelerators. In order to satisfy both CPU and accelerator programmers, it is still not clear which programming interface will eventually turn out to become a de facto standard. Next to GPUs by NVIDIA and AMD, another interesting approach in the accelerator world is Intel's MIC architecture, with a couple of supercomputing projects already being built around this architecture. As it is based on the x86 ISA including the full tool chain from compilers to debuggers to performance analysis tools, MIC aims at minimizing porting effort to accelerators from the programmer's point of view. The talk will present examples from high performance computing that fall into the three abovementioned categories, and how these code examples have been adapted to modern processor and accelerator architectures.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Is GPU enthusiasm vanishing?\",\"authors\":\"C. Trinitis\",\"doi\":\"10.1109/HPCSim.2012.6266945\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, there has been quite a hype on porting compute intensive kernels to GPUs, claiming impressive speedups of sometimes up to more than 100. However, looking at a number of compute intensive applications that have been investigated at TUM, the outcome looks slightly different. In addition, the overhead for porting applications to GPUs, or, more generally speaking, to accelerators, need be taken into consideration. As both very promising and very disappointing results can be obtained on accelerators (depending on the application), as usual the community is divided into GPU enthusiasts on the one hand and GPU opponents on the other hand. In both industrial and academic practice, the question arises what to do with existing compute intensive applications (often numerical simulation codes) that have existed for years or even decades, and which are treated as “never change a running system” code. Basically, these can be divided into three categories: - Code that should not be touched as it most likely will no longer run if anything will be modified (complete rewrite required if it is to run efficiently) - Code where compute intensive parts can be rewritten (partial rewrite required), and - Code that can easily be ported to new programming paradigms (easy adapting possible). Given the fact that CPUs integrate more and more features known from accelerators, one could conclude that most codes would fall into the third category, as the required porting effort seems to be shrinking and compilers are constantly improving. However, although features like automatic parallelization can be carried out with compilers, tuning by hand coding or using hardware specific programming paradigms still outperforms generic approaches. As GPU enthusiasts are mainly keen on using CUDA (with some of them moving to OpenCL), GPU opponents claim that by hardcore optimization of compute intensive numerical code, CPUs can reach equal or even better results than accelerators, hence taking vector units operating on AVX registers as on chip accelerators. In order to satisfy both CPU and accelerator programmers, it is still not clear which programming interface will eventually turn out to become a de facto standard. Next to GPUs by NVIDIA and AMD, another interesting approach in the accelerator world is Intel's MIC architecture, with a couple of supercomputing projects already being built around this architecture. As it is based on the x86 ISA including the full tool chain from compilers to debuggers to performance analysis tools, MIC aims at minimizing porting effort to accelerators from the programmer's point of view. The talk will present examples from high performance computing that fall into the three abovementioned categories, and how these code examples have been adapted to modern processor and accelerator architectures.\",\"PeriodicalId\":428764,\"journal\":{\"name\":\"2012 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCSim.2012.6266945\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2012.6266945","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
In recent years, there has been quite a hype on porting compute intensive kernels to GPUs, claiming impressive speedups of sometimes up to more than 100. However, looking at a number of compute intensive applications that have been investigated at TUM, the outcome looks slightly different. In addition, the overhead for porting applications to GPUs, or, more generally speaking, to accelerators, need be taken into consideration. As both very promising and very disappointing results can be obtained on accelerators (depending on the application), as usual the community is divided into GPU enthusiasts on the one hand and GPU opponents on the other hand. In both industrial and academic practice, the question arises what to do with existing compute intensive applications (often numerical simulation codes) that have existed for years or even decades, and which are treated as “never change a running system” code. Basically, these can be divided into three categories: - Code that should not be touched as it most likely will no longer run if anything will be modified (complete rewrite required if it is to run efficiently) - Code where compute intensive parts can be rewritten (partial rewrite required), and - Code that can easily be ported to new programming paradigms (easy adapting possible). Given the fact that CPUs integrate more and more features known from accelerators, one could conclude that most codes would fall into the third category, as the required porting effort seems to be shrinking and compilers are constantly improving. However, although features like automatic parallelization can be carried out with compilers, tuning by hand coding or using hardware specific programming paradigms still outperforms generic approaches. As GPU enthusiasts are mainly keen on using CUDA (with some of them moving to OpenCL), GPU opponents claim that by hardcore optimization of compute intensive numerical code, CPUs can reach equal or even better results than accelerators, hence taking vector units operating on AVX registers as on chip accelerators. In order to satisfy both CPU and accelerator programmers, it is still not clear which programming interface will eventually turn out to become a de facto standard. Next to GPUs by NVIDIA and AMD, another interesting approach in the accelerator world is Intel's MIC architecture, with a couple of supercomputing projects already being built around this architecture. As it is based on the x86 ISA including the full tool chain from compilers to debuggers to performance analysis tools, MIC aims at minimizing porting effort to accelerators from the programmer's point of view. The talk will present examples from high performance computing that fall into the three abovementioned categories, and how these code examples have been adapted to modern processor and accelerator architectures.