首页 > 最新文献

2014 First Workshop on Accelerator Programming using Directives最新文献

英文 中文
Achieving Portability and Performance through OpenACC 通过OpenACC实现可移植性和性能
Pub Date : 2014-11-16 DOI: 10.1109/WACCPD.2014.10
J. Herdman, W. Gaudin, O. Perks, D. Beckingsale, A. Mallinson, S. Jarvis
OpenACC is a directive-based programming model designed to allow easy access to emerging advanced architecture systems for existing production codes based on Fortran, C and C++. It also provides an approach to coding contemporary technologies without the need to learn complex vendor-specific languages, or understand the hardware at the deepest level. Portability and performance are the key features of this programming model, which are essential to productivity in real scientific applications. OpenACC support is provided by a number of vendors and is defined by an open standard. However the standard is relatively new, and the implementations are relatively immature. This paper experimentally evaluates the currently available compilers by assessing two approaches to the OpenACC programming model: the "parallel" and "kernels" constructs. The implementation of both of these construct is compared, for each vendor, showing performance differences of up to 84%. Additionally, we observe performance differences of up to 13% between the best vendor implementations. OpenACC features which appear to cause performance issues in certain compilers are identified and linked to differing default vector length clauses between vendors. These studies are carried out over a range of hardware including GPU, APU, Xeon and Xeon Phi based architectures. Finally, OpenACC performance, and productivity, are compared against the alternative native programming approaches on each targeted platform, including CUDA, OpenCL, OpenMP 4.0 and Intel Offload, in addition to MPI and OpenMP.
OpenACC是一种基于指令的编程模型,旨在方便地访问基于Fortran、C和c++的现有产品代码的新兴高级体系结构系统。它还提供了一种对现代技术进行编码的方法,而不需要学习复杂的特定于供应商的语言,或者在最深层次上理解硬件。可移植性和性能是这种编程模型的关键特征,这对于真正的科学应用程序的生产力至关重要。OpenACC支持由许多供应商提供,并由开放标准定义。然而,该标准相对较新,其实现也相对不成熟。本文通过评估OpenACC编程模型的两种方法:“并行”和“内核”结构,实验性地评估了目前可用的编译器。对每个供应商的这两种结构的实现进行了比较,显示出高达84%的性能差异。此外,我们观察到最佳供应商实现之间的性能差异高达13%。在某些编译器中导致性能问题的OpenACC特性被识别出来,并与供应商之间不同的默认向量长度子句相关联。这些研究是在一系列硬件上进行的,包括GPU, APU, Xeon和Xeon Phi基于架构。最后,将OpenACC的性能和生产力与每个目标平台上的替代本地编程方法进行比较,除了MPI和OpenMP之外,还包括CUDA, OpenCL, OpenMP 4.0和Intel Offload。
{"title":"Achieving Portability and Performance through OpenACC","authors":"J. Herdman, W. Gaudin, O. Perks, D. Beckingsale, A. Mallinson, S. Jarvis","doi":"10.1109/WACCPD.2014.10","DOIUrl":"https://doi.org/10.1109/WACCPD.2014.10","url":null,"abstract":"OpenACC is a directive-based programming model designed to allow easy access to emerging advanced architecture systems for existing production codes based on Fortran, C and C++. It also provides an approach to coding contemporary technologies without the need to learn complex vendor-specific languages, or understand the hardware at the deepest level. Portability and performance are the key features of this programming model, which are essential to productivity in real scientific applications. OpenACC support is provided by a number of vendors and is defined by an open standard. However the standard is relatively new, and the implementations are relatively immature. This paper experimentally evaluates the currently available compilers by assessing two approaches to the OpenACC programming model: the \"parallel\" and \"kernels\" constructs. The implementation of both of these construct is compared, for each vendor, showing performance differences of up to 84%. Additionally, we observe performance differences of up to 13% between the best vendor implementations. OpenACC features which appear to cause performance issues in certain compilers are identified and linked to differing default vector length clauses between vendors. These studies are carried out over a range of hardware including GPU, APU, Xeon and Xeon Phi based architectures. Finally, OpenACC performance, and productivity, are compared against the alternative native programming approaches on each targeted platform, including CUDA, OpenCL, OpenMP 4.0 and Intel Offload, in addition to MPI and OpenMP.","PeriodicalId":179664,"journal":{"name":"2014 First Workshop on Accelerator Programming using Directives","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123309836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
XcalableACC: Extension of XcalableMP PGAS Language Using OpenACC for Accelerator Clusters XcalableACC:扩展XcalableMP PGAS语言使用OpenACC加速器集群
Pub Date : 2014-11-16 DOI: 10.1109/WACCPD.2014.6
M. Nakao, H. Murai, Takenori Shimosaka, Akihiro Tabuchi, T. Hanawa, Yuetsu Kodama, T. Boku, M. Sato
The present paper introduces the XcalableACC (XACC) programming model, which is a hybrid model of the XcalableMP (XMP) Partitioned Global Address Space (PGAS) language and OpenACC. XACC defines directives that enable programmers to mix XMP and OpenACC directives in order to develop applications that can use accelerator clusters with ease. Moreover, in order to improve the performance of stencil applications, the Omni XACC compiler provides functions that can transfer a halo region on accelerator memory via Tightly Coupled Accelerators (TCA), which is a proprietary network for transferring data directly among accelerators. In the present paper, we evaluate the productivity and the performance of XACC through implementations of the HIMENO Benchmark. The results show that thanks to the productivity improvements, XACC requires less than half the source lines of code compare to a combination of Message Passing Interface (MPI) and OpenACC, which is commonly used together as a typical programming model. As a result of these performance improvements, XACC using TCA achieved up to 2.7 times faster performance than could be obtained via the combination of OpenACC and MPI programming model using GPUDirect RDMA over InfiniBand.
本文介绍了XcalableACC (XACC)编程模型,它是XcalableMP (XMP)分区全局地址空间(PGAS)语言和OpenACC的混合模型。XACC定义了一些指令,使程序员能够混合使用XMP和OpenACC指令,以便开发可以轻松使用加速器集群的应用程序。此外,为了提高模板应用程序的性能,Omni XACC编译器提供了可以通过紧耦合加速器(TCA)在加速器内存上传输晕区的函数,TCA是在加速器之间直接传输数据的专有网络。在本文中,我们通过HIMENO基准的实现来评估XACC的生产力和性能。结果表明,由于生产力的提高,与消息传递接口(MPI)和OpenACC的组合相比,XACC只需要不到一半的源代码行,后者通常作为一种典型的编程模型一起使用。由于这些性能改进,使用TCA的XACC实现的性能比使用GPUDirect RDMA在InfiniBand上结合OpenACC和MPI编程模型获得的性能快2.7倍。
{"title":"XcalableACC: Extension of XcalableMP PGAS Language Using OpenACC for Accelerator Clusters","authors":"M. Nakao, H. Murai, Takenori Shimosaka, Akihiro Tabuchi, T. Hanawa, Yuetsu Kodama, T. Boku, M. Sato","doi":"10.1109/WACCPD.2014.6","DOIUrl":"https://doi.org/10.1109/WACCPD.2014.6","url":null,"abstract":"The present paper introduces the XcalableACC (XACC) programming model, which is a hybrid model of the XcalableMP (XMP) Partitioned Global Address Space (PGAS) language and OpenACC. XACC defines directives that enable programmers to mix XMP and OpenACC directives in order to develop applications that can use accelerator clusters with ease. Moreover, in order to improve the performance of stencil applications, the Omni XACC compiler provides functions that can transfer a halo region on accelerator memory via Tightly Coupled Accelerators (TCA), which is a proprietary network for transferring data directly among accelerators. In the present paper, we evaluate the productivity and the performance of XACC through implementations of the HIMENO Benchmark. The results show that thanks to the productivity improvements, XACC requires less than half the source lines of code compare to a combination of Message Passing Interface (MPI) and OpenACC, which is commonly used together as a typical programming model. As a result of these performance improvements, XACC using TCA achieved up to 2.7 times faster performance than could be obtained via the combination of OpenACC and MPI programming model using GPUDirect RDMA over InfiniBand.","PeriodicalId":179664,"journal":{"name":"2014 First Workshop on Accelerator Programming using Directives","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131029517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Accelerating a C++ CFD Code with OpenACC 使用OpenACC加速c++ CFD代码
Pub Date : 2014-11-16 DOI: 10.1109/WACCPD.2014.11
J. Kraus, Michael Schlottke, A. Adinetz, D. Pleiter
Todays HPC systems are increasingly utilizing accelerators to lower time to solution for their users and reduce power consumption. To utilize the higher performance and energy efficiency of these accelerators, application developers need to rewrite at least parts of their codes. Taking the C++ flow solver ZFS as an example, we show that the directive-based programming model allows one to achieve good performance with reasonable effort, even for mature codes with many lines of code. Using OpenACC directives permitted us to incrementally accelerate ZFS, focusing on the parts of the program that are relevant for the problem at hand. The two new OpenACC 2.0 features, unstructured data regions and atomics, are required for this. OpenACC's interoperability with existing GPU libraries via the host_data use_device construct allowed to use CUDAaware MPI to achieve multi-GPU scalability comparable to the CPU version of ZFS. Like many other codes, the data structures of ZFS have been designed with traditional CPUs and their relatively large private caches in mind. This leads to suboptimal memory access patterns on accelerators, such as GPUs. We show how the texture cache on NVIDIA GPUs can be used to minimize the performance impact of these suboptimal patterns without writing platform specific code. For the kernel most affected by the memory access pattern, we compare the initial array of structures memory layout with a structure of arrays layout.
如今,HPC系统越来越多地利用加速器来缩短用户解决方案的时间并降低功耗。为了利用这些加速器的更高性能和能效,应用程序开发人员至少需要重写部分代码。以c++流求解器ZFS为例,我们展示了基于指令的编程模型允许人们通过合理的努力获得良好的性能,甚至对于具有许多行代码的成熟代码也是如此。使用OpenACC指令允许我们逐步加速ZFS,专注于与手头问题相关的程序部分。这需要两个新的OpenACC 2.0特性,非结构化数据区域和原子。OpenACC通过host_data use_device构造与现有GPU库的互操作性允许使用CUDAaware MPI来实现与ZFS的CPU版本相当的多GPU可扩展性。与许多其他代码一样,ZFS的数据结构在设计时考虑了传统cpu及其相对较大的私有缓存。这导致加速器(如gpu)上的内存访问模式不是最优的。我们展示了如何使用NVIDIA gpu上的纹理缓存来最小化这些次优模式的性能影响,而无需编写特定平台的代码。对于受内存访问模式影响最大的内核,我们比较了初始结构数组的内存布局和数组结构的内存布局。
{"title":"Accelerating a C++ CFD Code with OpenACC","authors":"J. Kraus, Michael Schlottke, A. Adinetz, D. Pleiter","doi":"10.1109/WACCPD.2014.11","DOIUrl":"https://doi.org/10.1109/WACCPD.2014.11","url":null,"abstract":"Todays HPC systems are increasingly utilizing accelerators to lower time to solution for their users and reduce power consumption. To utilize the higher performance and energy efficiency of these accelerators, application developers need to rewrite at least parts of their codes. Taking the C++ flow solver ZFS as an example, we show that the directive-based programming model allows one to achieve good performance with reasonable effort, even for mature codes with many lines of code. Using OpenACC directives permitted us to incrementally accelerate ZFS, focusing on the parts of the program that are relevant for the problem at hand. The two new OpenACC 2.0 features, unstructured data regions and atomics, are required for this. OpenACC's interoperability with existing GPU libraries via the host_data use_device construct allowed to use CUDAaware MPI to achieve multi-GPU scalability comparable to the CPU version of ZFS. Like many other codes, the data structures of ZFS have been designed with traditional CPUs and their relatively large private caches in mind. This leads to suboptimal memory access patterns on accelerators, such as GPUs. We show how the texture cache on NVIDIA GPUs can be used to minimize the performance impact of these suboptimal patterns without writing platform specific code. For the kernel most affected by the memory access pattern, we compare the initial array of structures memory layout with a structure of arrays layout.","PeriodicalId":179664,"journal":{"name":"2014 First Workshop on Accelerator Programming using Directives","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124816589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Accelerating Kirchhoff Migration on GPU Using Directives 使用指令加速GPU上的Kirchhoff迁移
Pub Date : 2014-11-16 DOI: 10.1109/WACCPD.2014.8
Rengan Xu, M. Hugues, H. Calandra, S. Chandrasekaran, B. Chapman
Accelerators offer the potential to significantly improve the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. However, effectively tapping their full potential is difficult owing to the programmability challenges faced by the users when mapping computation algorithms to the massively parallel architectures such as GPUs.Directive-based programming models offer programmers an option to rapidly create prototype applications by annotating region of code for offloading with hints to the compiler. This is critical to improve the productivity in the production code. In this paper, we study the effectiveness of a high-level directivebased programming model, OpenACC, for parallelizing a seismic migration application called Kirchhoff Migration on GPU architecture. Kirchhoff Migration is a real-world production code in the Oil & Gas industry. Because of its compute intensive property, we focus on the computation part and explore different mechanisms to effectively harness GPU's computation capabilities and memory hierarchy. We also analyze different loop transformation techniques in different OpenACC compilers and compare their performance differences. Compared toone socket (10 CPU cores) on the experimental platform, one GPU achieved a maximum speedup of 20.54x and 6.72x for interpolation and extrapolation kernel functions.
当将程序的计算密集型部分卸载到加速器时,加速器提供了显著提高科学应用程序性能的潜力。然而,由于用户在将计算算法映射到gpu等大规模并行架构时所面临的可编程性挑战,有效地挖掘其全部潜力是困难的。基于指令的编程模型为程序员提供了一种快速创建原型应用程序的选择,方法是在要卸载的代码区域上加上注释,并向编译器提供提示。这对于提高生产代码的生产力至关重要。在本文中,我们研究了基于高级指令的编程模型OpenACC在GPU架构上并行化一个名为Kirchhoff migration的地震迁移应用程序的有效性。Kirchhoff Migration是石油和天然气行业的实际生产代码。由于GPU的计算密集型特性,我们将重点放在计算部分,探索不同的机制来有效地利用GPU的计算能力和内存层次。我们还分析了不同OpenACC编译器中的不同循环转换技术,并比较了它们的性能差异。与实验平台上的一个插槽(10个CPU内核)相比,一个GPU在插值和外推内核函数上实现了20.54倍和6.72倍的最大加速。
{"title":"Accelerating Kirchhoff Migration on GPU Using Directives","authors":"Rengan Xu, M. Hugues, H. Calandra, S. Chandrasekaran, B. Chapman","doi":"10.1109/WACCPD.2014.8","DOIUrl":"https://doi.org/10.1109/WACCPD.2014.8","url":null,"abstract":"Accelerators offer the potential to significantly improve the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. However, effectively tapping their full potential is difficult owing to the programmability challenges faced by the users when mapping computation algorithms to the massively parallel architectures such as GPUs.Directive-based programming models offer programmers an option to rapidly create prototype applications by annotating region of code for offloading with hints to the compiler. This is critical to improve the productivity in the production code. In this paper, we study the effectiveness of a high-level directivebased programming model, OpenACC, for parallelizing a seismic migration application called Kirchhoff Migration on GPU architecture. Kirchhoff Migration is a real-world production code in the Oil & Gas industry. Because of its compute intensive property, we focus on the computation part and explore different mechanisms to effectively harness GPU's computation capabilities and memory hierarchy. We also analyze different loop transformation techniques in different OpenACC compilers and compare their performance differences. Compared toone socket (10 CPU cores) on the experimental platform, one GPU achieved a maximum speedup of 20.54x and 6.72x for interpolation and extrapolation kernel functions.","PeriodicalId":179664,"journal":{"name":"2014 First Workshop on Accelerator Programming using Directives","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127323005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study 用于基于指令的加速器编程研究的可扩展的OpenACC编译器框架
Pub Date : 2014-11-16 DOI: 10.1109/WACCPD.2014.7
Seyong Lee, J. Vetter
Directive-based, accelerator programming models such as OpenACC have arisen as an alternative solution to program emerging Scalable Heterogeneous Computing (SHC) platforms. However, the increased complexity in the SHC systems incurs several challenges in terms of portability and productivity. This paper presents an open-sourced OpenACC compiler, called OpenARC, which serves as an extensible research framework to address those issues in the directive-based accelerator programming. This paper explains important design strategies and key compiler transformation techniques needed to implement the reference OpenACC compiler. Moreover, this paper demonstrates the efficacy of OpenARC as a research framework for directive-based programming study, by proposing and implementing OpenACC extensions in the OpenARC framework to 1) support hybrid programming of the unified memory and separate memory and 2) exploit architecture-specific features in an abstract manner. Porting thirteen standard OpenACC programs and three extended OpenACC programs to CUDA GPUs shows that OpenARC performs similarly to a commercial OpenACC compiler, while it serves as a high-level research framework.
基于指令的加速器编程模型(如OpenACC)已经成为新兴的可伸缩异构计算(SHC)平台编程的替代解决方案。然而,SHC系统中增加的复杂性在可移植性和生产力方面带来了一些挑战。本文提出了一个开源的OpenACC编译器,称为OpenARC,它作为一个可扩展的研究框架来解决基于指令的加速器编程中的这些问题。本文解释了实现参考OpenACC编译器所需的重要设计策略和关键编译器转换技术。此外,本文还通过在OpenARC框架中提出并实现OpenACC扩展,证明了OpenARC作为基于指令的编程研究框架的有效性:1)支持统一存储器和独立存储器的混合编程;2)以抽象的方式利用特定于体系结构的特性。将13个标准的OpenACC程序和3个扩展的OpenACC程序移植到CUDA gpu上表明,OpenARC的性能与商业OpenACC编译器相似,同时它作为一个高级研究框架。
{"title":"OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study","authors":"Seyong Lee, J. Vetter","doi":"10.1109/WACCPD.2014.7","DOIUrl":"https://doi.org/10.1109/WACCPD.2014.7","url":null,"abstract":"Directive-based, accelerator programming models such as OpenACC have arisen as an alternative solution to program emerging Scalable Heterogeneous Computing (SHC) platforms. However, the increased complexity in the SHC systems incurs several challenges in terms of portability and productivity. This paper presents an open-sourced OpenACC compiler, called OpenARC, which serves as an extensible research framework to address those issues in the directive-based accelerator programming. This paper explains important design strategies and key compiler transformation techniques needed to implement the reference OpenACC compiler. Moreover, this paper demonstrates the efficacy of OpenARC as a research framework for directive-based programming study, by proposing and implementing OpenACC extensions in the OpenARC framework to 1) support hybrid programming of the unified memory and separate memory and 2) exploit architecture-specific features in an abstract manner. Porting thirteen standard OpenACC programs and three extended OpenACC programs to CUDA GPUs shows that OpenARC performs similarly to a commercial OpenACC compiler, while it serves as a high-level research framework.","PeriodicalId":179664,"journal":{"name":"2014 First Workshop on Accelerator Programming using Directives","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128979525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
An OpenACC Extension for Data Layout Transformation 数据布局转换的OpenACC扩展
Pub Date : 2014-11-16 DOI: 10.1109/WACCPD.2014.12
Tetsuya Hoshino, N. Maruyama, S. Matsuoka
OpenACC is gaining momentum as an implicit and portable interface in porting legacy CPU-based applications to heterogeneous, highly parallel computational environment involving many-core accelerators such as GPUs and Intel Xeon Phi. OpenACC provides a set of loop directives similar to OpenMP for the parallelization and also to manage data movement, attaining functional portability across different heterogeneous devices; however, the performance portability of OpenACC is said to be insufficient due to the characteristics of different target devices, especially those regarding memory layouts, as automated attempts by the compilers to adapt is currently difficult. We are currently working to propose a set of directives to allow compilers to have better semantic information for adaptation; here, we particularly focus on data layout such as Structure of Arrays, advantageous data structure for GPUs, as opposed to Array of Structures, which exhibits good performance on CPUs. We propose a directive extension to OpenACC that allows the users to flexibility specify optimal layouts, even if the data structures are nested. Performance results show that we gain as much as 96 % in performance for CPUs and 165% for GPUs compared to programs without such directives, essentially attaining both functional and performance portability in OpenACC.
OpenACC作为一种隐式的可移植接口,正在将传统的基于cpu的应用程序移植到异构的、高度并行的计算环境中,涉及到多核加速器(如gpu和Intel Xeon Phi)。OpenACC提供了一组类似于OpenMP的循环指令,用于并行化和管理数据移动,实现跨不同异构设备的功能可移植性;然而,由于不同目标设备的特性,特别是那些关于内存布局的特性,OpenACC的性能可移植性据说是不够的,因为编译器自动尝试适应目前是困难的。我们目前正在努力提出一套指令,以允许编译器有更好的语义信息来适应;在这里,我们特别关注数据布局,如结构数组,对gpu有利的数据结构,而不是数组结构,在cpu上表现出良好的性能。我们建议对OpenACC进行指令扩展,允许用户灵活地指定最佳布局,即使数据结构是嵌套的。性能结果表明,与没有这些指令的程序相比,我们的cpu性能提高了96%,gpu性能提高了165%,基本上在OpenACC中实现了功能和性能的可移植性。
{"title":"An OpenACC Extension for Data Layout Transformation","authors":"Tetsuya Hoshino, N. Maruyama, S. Matsuoka","doi":"10.1109/WACCPD.2014.12","DOIUrl":"https://doi.org/10.1109/WACCPD.2014.12","url":null,"abstract":"OpenACC is gaining momentum as an implicit and portable interface in porting legacy CPU-based applications to heterogeneous, highly parallel computational environment involving many-core accelerators such as GPUs and Intel Xeon Phi. OpenACC provides a set of loop directives similar to OpenMP for the parallelization and also to manage data movement, attaining functional portability across different heterogeneous devices; however, the performance portability of OpenACC is said to be insufficient due to the characteristics of different target devices, especially those regarding memory layouts, as automated attempts by the compilers to adapt is currently difficult. We are currently working to propose a set of directives to allow compilers to have better semantic information for adaptation; here, we particularly focus on data layout such as Structure of Arrays, advantageous data structure for GPUs, as opposed to Array of Structures, which exhibits good performance on CPUs. We propose a directive extension to OpenACC that allows the users to flexibility specify optimal layouts, even if the data structures are nested. Performance results show that we gain as much as 96 % in performance for CPUs and 165% for GPUs compared to programs without such directives, essentially attaining both functional and performance portability in OpenACC.","PeriodicalId":179664,"journal":{"name":"2014 First Workshop on Accelerator Programming using Directives","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132902741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Directive-Based Parallelization of the NIM Weather Model for GPUs 面向gpu的NIM天气模型的定向并行化
Pub Date : 2014-11-16 DOI: 10.1109/WACCPD.2014.9
M. Govett, J. Middlecoff, T. Henderson
The NIM is a performance-portable model that runs on CPU, GPU and MIC architectures with a single source code. The single source plus efficient code design allows application scientists to maintain the Fortran code, while computer scientists optimize performance and portability using OpenMP, OpenACC, and F2CACC directives. The F2C-ACC compiler was developed in 2008 at NOAA's Earth System Research Laboratory (ESRL) to support GPU parallelization before commercial Fortran GPU compilers were available. Since then, a number of vendors have built GPU compilers that are compliant to the emerging OpenACC standard. The paper will compare parallelization and performance of NIM using the F2C-ACC, Cray and PGI Fortran GPU compilers.
NIM是一个性能可移植的模型,运行在CPU、GPU和MIC架构上,只有一个源代码。单一源代码加上高效的代码设计允许应用程序科学家维护Fortran代码,而计算机科学家使用OpenMP、OpenACC和F2CACC指令优化性能和可移植性。F2C-ACC编译器于2008年由NOAA的地球系统研究实验室(ESRL)开发,在商用Fortran GPU编译器可用之前支持GPU并行化。从那时起,许多厂商已经构建了符合新兴的OpenACC标准的GPU编译器。本文将比较使用F2C-ACC, Cray和PGI Fortran GPU编译器的NIM的并行化和性能。
{"title":"Directive-Based Parallelization of the NIM Weather Model for GPUs","authors":"M. Govett, J. Middlecoff, T. Henderson","doi":"10.1109/WACCPD.2014.9","DOIUrl":"https://doi.org/10.1109/WACCPD.2014.9","url":null,"abstract":"The NIM is a performance-portable model that runs on CPU, GPU and MIC architectures with a single source code. The single source plus efficient code design allows application scientists to maintain the Fortran code, while computer scientists optimize performance and portability using OpenMP, OpenACC, and F2CACC directives. The F2C-ACC compiler was developed in 2008 at NOAA's Earth System Research Laboratory (ESRL) to support GPU parallelization before commercial Fortran GPU compilers were available. Since then, a number of vendors have built GPU compilers that are compliant to the emerging OpenACC standard. The paper will compare parallelization and performance of NIM using the F2C-ACC, Cray and PGI Fortran GPU compilers.","PeriodicalId":179664,"journal":{"name":"2014 First Workshop on Accelerator Programming using Directives","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124891754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
期刊
2014 First Workshop on Accelerator Programming using Directives
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1