Yueming Wei, Yichao Wang, Linjin Cai, W. Tang, Bei Wang, S. Ethier, S. See, James Lin
{"title":"使用OpenACC加速版GTC-P进行性能和可移植性研究","authors":"Yueming Wei, Yichao Wang, Linjin Cai, W. Tang, Bei Wang, S. Ethier, S. See, James Lin","doi":"10.1109/PDCAT.2016.019","DOIUrl":null,"url":null,"abstract":"Accelerator-based heterogeneous computing is of paramount importance to High Performance Computing. The increasing complexity of the cluster architectures requires more generic, high-level programming models. OpenACC is a directive-based parallel programming model, which provides performance on and portability across a wide variety of platforms, including GPU, multicore CPU, and many-core processors. GTC-P is a discovery-science-capable real-world application code based on the Particle-In-Cell (PIC) algorithm that is well-established in the HPC area. Several native versions of GTC-P have been developed for supercomputers on TOP500 with different architectures, including Titan, Mira, etc. Motivated by the state-of-art portability, we implemented the first OpenACC version of GTC-P and evaluated its performance portability across NVIDIA GPUs, Intel x86 and OpenPOWER CPUs. In this paper, we also proposed two key optimization methods for OpenACC implementation of PIC algorithm on multicore CPU and GPU including removing atomic operation and taking advantage of shared memory. OpenACC shows both impressive productivity and performance in a perspective of portability and scalability. The OpenACC version achieves more than 90% performance compared with the native versions with only about 300 LOC.","PeriodicalId":203925,"journal":{"name":"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Performance and Portability Studies with OpenACC Accelerated Version of GTC-P\",\"authors\":\"Yueming Wei, Yichao Wang, Linjin Cai, W. Tang, Bei Wang, S. Ethier, S. See, James Lin\",\"doi\":\"10.1109/PDCAT.2016.019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accelerator-based heterogeneous computing is of paramount importance to High Performance Computing. The increasing complexity of the cluster architectures requires more generic, high-level programming models. OpenACC is a directive-based parallel programming model, which provides performance on and portability across a wide variety of platforms, including GPU, multicore CPU, and many-core processors. GTC-P is a discovery-science-capable real-world application code based on the Particle-In-Cell (PIC) algorithm that is well-established in the HPC area. Several native versions of GTC-P have been developed for supercomputers on TOP500 with different architectures, including Titan, Mira, etc. Motivated by the state-of-art portability, we implemented the first OpenACC version of GTC-P and evaluated its performance portability across NVIDIA GPUs, Intel x86 and OpenPOWER CPUs. In this paper, we also proposed two key optimization methods for OpenACC implementation of PIC algorithm on multicore CPU and GPU including removing atomic operation and taking advantage of shared memory. OpenACC shows both impressive productivity and performance in a perspective of portability and scalability. The OpenACC version achieves more than 90% performance compared with the native versions with only about 300 LOC.\",\"PeriodicalId\":203925,\"journal\":{\"name\":\"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDCAT.2016.019\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2016.019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance and Portability Studies with OpenACC Accelerated Version of GTC-P
Accelerator-based heterogeneous computing is of paramount importance to High Performance Computing. The increasing complexity of the cluster architectures requires more generic, high-level programming models. OpenACC is a directive-based parallel programming model, which provides performance on and portability across a wide variety of platforms, including GPU, multicore CPU, and many-core processors. GTC-P is a discovery-science-capable real-world application code based on the Particle-In-Cell (PIC) algorithm that is well-established in the HPC area. Several native versions of GTC-P have been developed for supercomputers on TOP500 with different architectures, including Titan, Mira, etc. Motivated by the state-of-art portability, we implemented the first OpenACC version of GTC-P and evaluated its performance portability across NVIDIA GPUs, Intel x86 and OpenPOWER CPUs. In this paper, we also proposed two key optimization methods for OpenACC implementation of PIC algorithm on multicore CPU and GPU including removing atomic operation and taking advantage of shared memory. OpenACC shows both impressive productivity and performance in a perspective of portability and scalability. The OpenACC version achieves more than 90% performance compared with the native versions with only about 300 LOC.