K. Fürlinger, R. Kowalewski, Tobias Fuchs, Benedikt Lehmann
DASH is a new realization of the PGAS (Partitioned Global Address Space) programming model in the form of a C++ template library. Instead of using a custom compiler, DASH provides expressive programming constructs using C++ abstraction mechanisms and offers distributed data structures and parallel algorithms that follow the concepts employed by the C++ standard template library (STL). In this paper we evaluate the performance and productivity of DASH by comparing our implementation of a set of benchmark programs with those developed by expert programmers in Intel Cilk, Intel TBB (Threading Building Blocks), Go and Chapel. We perform a comparison on shared memory multiprocessor systems ranging from moderately parallel multicore systems to a 64-core manycore system. We additionally perform a scalability study on a distributed memory system on up to 20 nodes (800 cores). Our results demonstrate that DASH offers productivity that is comparable with the best established programming systems for shared memory and also achieves comparable or better performance. Our results on multi-node systems show that DASH scales well and achieves excellent performance.
DASH是以c++模板库的形式实现PGAS (Partitioned Global Address Space)编程模型。DASH没有使用自定义编译器,而是使用c++抽象机制提供表达性编程结构,并提供遵循c++标准模板库(STL)概念的分布式数据结构和并行算法。在本文中,我们通过将我们实现的一组基准程序与专家程序员在英特尔Cilk、英特尔TBB(线程构建块)、Go和Chapel中开发的程序进行比较,来评估DASH的性能和生产力。我们对共享内存多处理器系统进行了比较,从中等并行多核系统到64核多核系统。我们还在一个多达20个节点(800核)的分布式内存系统上进行了可扩展性研究。我们的研究结果表明,DASH提供的生产力可与现有的最佳共享内存编程系统相媲美,并且还实现了相当或更好的性能。我们在多节点系统上的实验结果表明,DASH具有良好的可扩展性和优异的性能。
{"title":"Investigating the performance and productivity of DASH using the Cowichan problems","authors":"K. Fürlinger, R. Kowalewski, Tobias Fuchs, Benedikt Lehmann","doi":"10.1145/3176364.3176366","DOIUrl":"https://doi.org/10.1145/3176364.3176366","url":null,"abstract":"DASH is a new realization of the PGAS (Partitioned Global Address Space) programming model in the form of a C++ template library. Instead of using a custom compiler, DASH provides expressive programming constructs using C++ abstraction mechanisms and offers distributed data structures and parallel algorithms that follow the concepts employed by the C++ standard template library (STL). In this paper we evaluate the performance and productivity of DASH by comparing our implementation of a set of benchmark programs with those developed by expert programmers in Intel Cilk, Intel TBB (Threading Building Blocks), Go and Chapel. We perform a comparison on shared memory multiprocessor systems ranging from moderately parallel multicore systems to a 64-core manycore system. We additionally perform a scalability study on a distributed memory system on up to 20 nodes (800 cores). Our results demonstrate that DASH offers productivity that is comparable with the best established programming systems for shared memory and also achieves comparable or better performance. Our results on multi-node systems show that DASH scales well and achieves excellent performance.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126508238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The particle-in-cell (PIC) code is one of the mainstream algorithms in the laser plasma research area. However, the programming challenges to achieve high performance of PIC codes on the Intel Knights Landing (KNL) processor is widely concerned by global laser plasma researchers. We took the VLPL-S, the PIC code developed at Shanghai Jiao Tong University, as an example to address this concern. We applied the three types of optimization: compute-oriented optimizations, parallel 10, and dynamic loading balancing. We evaluated the optimized VLPL-S code with real test cases on the KNL. The experiments results show our optimization can achieve 1.53X speedup in overall performance, and the performance on the KNL is 1.77X faster than that of a two-socket Intel Xeon E5-2697v4 node. The optimizations we developed for the VLPS-S code can be applied to the other PIC codes.
{"title":"Optimizing a particle-in-cell code on Intel knights landing","authors":"Minhua Wen, Min Chen, James Lin","doi":"10.1145/3176364.3176376","DOIUrl":"https://doi.org/10.1145/3176364.3176376","url":null,"abstract":"The particle-in-cell (PIC) code is one of the mainstream algorithms in the laser plasma research area. However, the programming challenges to achieve high performance of PIC codes on the Intel Knights Landing (KNL) processor is widely concerned by global laser plasma researchers. We took the VLPL-S, the PIC code developed at Shanghai Jiao Tong University, as an example to address this concern. We applied the three types of optimization: compute-oriented optimizations, parallel 10, and dynamic loading balancing. We evaluated the optimized VLPL-S code with real test cases on the KNL. The experiments results show our optimization can achieve 1.53X speedup in overall performance, and the performance on the KNL is 1.77X faster than that of a two-socket Intel Xeon E5-2697v4 node. The optimizations we developed for the VLPS-S code can be applied to the other PIC codes.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116791582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Partitioned Global Address Space (PGAS) programming model has become a viable alternative to traditional message passing using MPI. The DASH project provides a PGAS abstraction entirely based on C++11. The underlying DASH RunTime, DART, provides communication and management functionality transparently to the user. In order to facilitate incremental transitions of existing MPI-parallel codes, the development of DART has focused on creating a PGAS runtime based on the MPI-3 RMA standard. From an MPI-RMA user perspective, this paper outlines our recent experiences in the development of DART and presents insights into issues that we faced and how we attempted to solve them, including issues surrounding memory allocation and memory consistency as well as communication latencies. We implemented a set of benchmarks for global memory allocation latency in the framework of the OSU micro-benchmark suite and present results for allocation and communication latency measurements of different global memory allocation strategies under three different MPI implementations.
{"title":"Recent experiences in using MPI-3 RMA in the DASH PGAS runtime","authors":"Joseph Schuchart, R. Kowalewski, Karl Fuerlinger","doi":"10.1145/3176364.3176367","DOIUrl":"https://doi.org/10.1145/3176364.3176367","url":null,"abstract":"The Partitioned Global Address Space (PGAS) programming model has become a viable alternative to traditional message passing using MPI. The DASH project provides a PGAS abstraction entirely based on C++11. The underlying DASH RunTime, DART, provides communication and management functionality transparently to the user. In order to facilitate incremental transitions of existing MPI-parallel codes, the development of DART has focused on creating a PGAS runtime based on the MPI-3 RMA standard. From an MPI-RMA user perspective, this paper outlines our recent experiences in the development of DART and presents insights into issues that we faced and how we attempted to solve them, including issues surrounding memory allocation and memory consistency as well as communication latencies. We implemented a set of benchmarks for global memory allocation latency in the framework of the OSU micro-benchmark suite and present results for allocation and communication latency measurements of different global memory allocation strategies under three different MPI implementations.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116044790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When developing applications on high-performance computing (HPC) cluster systems, Partitioned Global Address Space (PGAS) languages are used due to their high productivity and performance. However, in order to more efficiently develop such applications, it is also important to be able to combine a PGAS language with other languages instead of using a single PGAS language alone. We have designed an XcalableMP (XMP) PGAS language, and developed Omni Compiler as an XMP compiler. In this paper, we report on the development of linkage functions between XMP and {C, Fortran, or Python} for Omni Compiler. Furthermore, as a functional example of interworking between XMP and Python, we discuss the development of an application for the Graph Order/degree problem. Specifically, we paralleled all of the shortest paths among the vertices searches of the application using XMP. When the results of the application in XMP and the original Python were compared, we found that the performance of XMP was 21% faster than that of the original Python on a single CPU core. Moreover, when applying the application on an HPC cluster system with 1,280 CPU cores of 64 compute nodes, we could achieve a 921 times better performance than that on a single CPU core.
{"title":"Linkage of XcalableMP and Python languages for high productivity on HPC cluster system: application to graph order/degree problem","authors":"M. Nakao, H. Murai, T. Boku, M. Sato","doi":"10.1145/3176364.3176369","DOIUrl":"https://doi.org/10.1145/3176364.3176369","url":null,"abstract":"When developing applications on high-performance computing (HPC) cluster systems, Partitioned Global Address Space (PGAS) languages are used due to their high productivity and performance. However, in order to more efficiently develop such applications, it is also important to be able to combine a PGAS language with other languages instead of using a single PGAS language alone. We have designed an XcalableMP (XMP) PGAS language, and developed Omni Compiler as an XMP compiler. In this paper, we report on the development of linkage functions between XMP and {C, Fortran, or Python} for Omni Compiler. Furthermore, as a functional example of interworking between XMP and Python, we discuss the development of an application for the Graph Order/degree problem. Specifically, we paralleled all of the shortest paths among the vertices searches of the application using XMP. When the results of the application in XMP and the original Python were compared, we found that the performance of XMP was 21% faster than that of the original Python on a single CPU core. Moreover, when applying the application on an HPC cluster system with 1,280 CPU cores of 64 compute nodes, we could achieve a 921 times better performance than that on a single CPU core.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"435 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126984581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roktaek Lim, Yeongha Lee, Raehyun Kim, Jaeyoung Choi
The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have emerged with 2D tile mesh architecture. Implementing of the general matrix-matrix multiplication on a new architecture is an important practice. To date, there has not been a sufficient description on a parallel implementation of the general matrix-matrix multiplication. In this study, we describe the parallel implementation of the double-precision general matrix-matrix multiplication (DGEMM) with OpenMP on the KNL. The implementation is based on the blocked matrix-matrix multiplication. We propose a method for choosing the cache block sizes and discuss the parallelism within the implementation of DGEMM. We show that the performance of DGEMM varies by the thread affinity environment variables. We conducted the performance experiments with the Intel Xeon Phi 7210 and 7250. The performance experiments validate our method.
{"title":"OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing","authors":"Roktaek Lim, Yeongha Lee, Raehyun Kim, Jaeyoung Choi","doi":"10.1145/3176364.3176374","DOIUrl":"https://doi.org/10.1145/3176364.3176374","url":null,"abstract":"The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have emerged with 2D tile mesh architecture. Implementing of the general matrix-matrix multiplication on a new architecture is an important practice. To date, there has not been a sufficient description on a parallel implementation of the general matrix-matrix multiplication. In this study, we describe the parallel implementation of the double-precision general matrix-matrix multiplication (DGEMM) with OpenMP on the KNL. The implementation is based on the blocked matrix-matrix multiplication. We propose a method for choosing the cache block sizes and discuss the parallelism within the implementation of DGEMM. We show that the performance of DGEMM varies by the thread affinity environment variables. We conducted the performance experiments with the Intel Xeon Phi 7210 and 7250. The performance experiments validate our method.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114225843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To reduce the programming cost on a cluster system, Partitioned Global Address Space (PGAS) languages are used. We have designed an XcalableMP (XMP) PGAS language and developed the Omni XMP compiler (Omni compiler) for XMP. In the present study, we evaluated the performance of the Omni compiler on Oakforest-PACS, which is a cluster system based on Knights Landing, and on a general Linux cluster system. We performed performance tuning for the Omni compiler using a Lattice QCD mini-application and some mathematical functions appearing in that application. As a result, the performance of the Omni compiler after tuning was improved compared to before tuning on both systems. Furthermore, we compared the performance of MPI and OpenMP (MPI+OpenMP), which is an existing programming model, to that of XMP with the tuned Omni compiler. The results showed that the performance of the Lattice QCD mini-application using XMP was achieving more than 94% of the implementation written in MPI + OpenMP.
{"title":"Performance evaluation for omni XcalableMP compiler on many-core cluster system based on knights landing","authors":"M. Nakao, H. Murai, T. Boku, M. Sato","doi":"10.1145/3176364.3176372","DOIUrl":"https://doi.org/10.1145/3176364.3176372","url":null,"abstract":"To reduce the programming cost on a cluster system, Partitioned Global Address Space (PGAS) languages are used. We have designed an XcalableMP (XMP) PGAS language and developed the Omni XMP compiler (Omni compiler) for XMP. In the present study, we evaluated the performance of the Omni compiler on Oakforest-PACS, which is a cluster system based on Knights Landing, and on a general Linux cluster system. We performed performance tuning for the Omni compiler using a Lattice QCD mini-application and some mathematical functions appearing in that application. As a result, the performance of the Omni compiler after tuning was improved compared to before tuning on both systems. Furthermore, we compared the performance of MPI and OpenMP (MPI+OpenMP), which is an existing programming model, to that of XMP with the tuned Omni compiler. The results showed that the performance of the Lattice QCD mini-application using XMP was achieving more than 94% of the implementation written in MPI + OpenMP.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131331010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we introduced some optimization methods we used to optimize two-electron repulsion integral calculation on Knights Landing architecture. We developed a schedule for parallelism and vectorization, and we compared two different methods for calculating lower incomplete gamma function. Our optimization achieved 1.7 speedup on KNL than CPU platform.
{"title":"Optimizing two-electron repulsion integral calculation on knights landing architecture","authors":"Yingqi Tian, B. Suo, Yingjin Ma, Zhong Jin","doi":"10.1145/3176364.3176371","DOIUrl":"https://doi.org/10.1145/3176364.3176371","url":null,"abstract":"In this paper, we introduced some optimization methods we used to optimize two-electron repulsion integral calculation on Knights Landing architecture. We developed a schedule for parallelism and vectorization, and we compared two different methods for calculating lower incomplete gamma function. Our optimization achieved 1.7 speedup on KNL than CPU platform.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124336960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masashi Horikoshi, L. Meadows, Tom Elken, P. Sivakumar, E. Mascarenhas, James Erwin, D. Durnov, Alexander Sannikov, T. Hanawa, T. Boku
This paper provides results on scaling Barrier and Allreduce to 8192 nodes on a cluster of Intel® Xeon Phi™ processors installed at the University of Tokyo and the University of Tsukuba. We will describe the effects of OS and platform noise on the performance of these collectives, and provide ways to minimize the noise as well as isolate it to specific cores. We will provide results showing that Barrier and Allreduce scale well when noise is reduced. We were able to achieve a latency of 94 usec (7.1x speedup from baseline) or 1 rank per node Barrier and 145 usec (3.3x speedup) for Allreduce at the 16 byte (16B) message size at 4096 nodes.
{"title":"Scaling collectives on large clusters using Intel(R) architecture processors and fabric","authors":"Masashi Horikoshi, L. Meadows, Tom Elken, P. Sivakumar, E. Mascarenhas, James Erwin, D. Durnov, Alexander Sannikov, T. Hanawa, T. Boku","doi":"10.1145/3176364.3176373","DOIUrl":"https://doi.org/10.1145/3176364.3176373","url":null,"abstract":"This paper provides results on scaling Barrier and Allreduce to 8192 nodes on a cluster of Intel® Xeon Phi™ processors installed at the University of Tokyo and the University of Tsukuba. We will describe the effects of OS and platform noise on the performance of these collectives, and provide ways to minimize the noise as well as isolate it to specific cores. We will provide results showing that Barrier and Allreduce scale well when noise is reduced. We were able to achieve a latency of 94 usec (7.1x speedup from baseline) or 1 rank per node Barrier and 145 usec (3.3x speedup) for Allreduce at the 16 byte (16B) message size at 4096 nodes.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"117 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114014571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Algebraic Multigrid (AMG) method has over the years developed into an efficient tool for solving unstructured linear systems. The need to solve large industrial problems discretized on unstructured meshes, has been a key motivation for devising a parallel AMG method. Despite some success, the key part of the AMG algorithm; the coarsening step, is far from trivial to parallelize efficiently. We here introduce a novel parallelization of the inherently sequential Ruge-Stüben coarsening algorithm, that retains most of the good interpolation properties of the original method. Our parallelization is based on the Partitioned Global Address Space (PGAS) abstraction, which greatly simplifies the parallelization as compared to traditional message passing based implementations. The coarsening algorithm and solver is described in detail and a performance study on a Cray XC40 is presented.
{"title":"Towards a parallel algebraic multigrid solver using PGAS","authors":"Niclas Jansson, E. Laure","doi":"10.1145/3176364.3176368","DOIUrl":"https://doi.org/10.1145/3176364.3176368","url":null,"abstract":"The Algebraic Multigrid (AMG) method has over the years developed into an efficient tool for solving unstructured linear systems. The need to solve large industrial problems discretized on unstructured meshes, has been a key motivation for devising a parallel AMG method. Despite some success, the key part of the AMG algorithm; the coarsening step, is far from trivial to parallelize efficiently. We here introduce a novel parallelization of the inherently sequential Ruge-Stüben coarsening algorithm, that retains most of the good interpolation properties of the original method. Our parallelization is based on the Partitioned Global Address Space (PGAS) abstraction, which greatly simplifies the parallelization as compared to traditional message passing based implementations. The coarsening algorithm and solver is described in detail and a performance study on a Cray XC40 is presented.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123272494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Akihiro Tabuchi, M. Nakao, H. Murai, T. Boku, M. Sato
Clusters equipped with accelerators such as GPUs and MICs are used widely. To use these clusters, programmers write programs for their applications by combining MPI with one of the accelerator programming models such as CUDA and OpenACC. The accelerator programming component is becoming easier because of a directive-based OpenACC, but complex distributed-memory programming using MPI means that programming is still difficult. In order to simplify the programming process, XcalableACC (XACC) has been proposed as an "orthogonal" integration of the PGAS language XcalableMP (XMP) and OpenACC. XACC provides the original XMP and OpenACC features, as well as their extensions for communication between accelerator memories. In this study, we implemented a hydrodynamics mini-application Clover-Leaf in XACC and evaluated the usability of XACC in terms of it performance and productivity. According to the performance evaluation, the XACC version achieved 87--95% of the performance of the MPI+CUDA version and 93--101% of the MPI+OpenACC version with strong scaling, and 88--91% of the MPI+CUDA version and 94--97% of the MPI+OpenACC version with weak scaling. In particular, the halo exchange time was better with XACC than MPI+OpenACC in some cases because the Omni XACC runtime is written in MPI and CUDA, and it is well tuned. The productivity evaluation showed that the application could be implemented after small changes compared with the serial version. These results demonstrate that XACC is a practical programming language for science applications.
{"title":"Performance evaluation for a hydrodynamics application in XcalableACC PGAS language for accelerated clusters","authors":"Akihiro Tabuchi, M. Nakao, H. Murai, T. Boku, M. Sato","doi":"10.1145/3176364.3176365","DOIUrl":"https://doi.org/10.1145/3176364.3176365","url":null,"abstract":"Clusters equipped with accelerators such as GPUs and MICs are used widely. To use these clusters, programmers write programs for their applications by combining MPI with one of the accelerator programming models such as CUDA and OpenACC. The accelerator programming component is becoming easier because of a directive-based OpenACC, but complex distributed-memory programming using MPI means that programming is still difficult. In order to simplify the programming process, XcalableACC (XACC) has been proposed as an \"orthogonal\" integration of the PGAS language XcalableMP (XMP) and OpenACC. XACC provides the original XMP and OpenACC features, as well as their extensions for communication between accelerator memories. In this study, we implemented a hydrodynamics mini-application Clover-Leaf in XACC and evaluated the usability of XACC in terms of it performance and productivity. According to the performance evaluation, the XACC version achieved 87--95% of the performance of the MPI+CUDA version and 93--101% of the MPI+OpenACC version with strong scaling, and 88--91% of the MPI+CUDA version and 94--97% of the MPI+OpenACC version with weak scaling. In particular, the halo exchange time was better with XACC than MPI+OpenACC in some cases because the Omni XACC runtime is written in MPI and CUDA, and it is well tuned. The productivity evaluation showed that the application could be implemented after small changes compared with the serial version. These results demonstrate that XACC is a practical programming language for science applications.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"561 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123390119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}